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ABSTRACT 


When  the  CellBE  processor  was  introduced,  the  Advanced  Encryption  Standard 
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independently.  For  ECB  encryption  our  version  is  slightly  faster  than  that  of  IBM;  for 
CBC  encryption  our  version  is  significantly  faster.  This  paper  describes  our  development 
process  and  design  tradeoffs,  with  emphasis  on  lessons  learned.  This  could  be  useful  for 
anyone  wishing  to  develop  high-speed  applications  on  the  CellBE. 
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1  Introduction 


Our  team  has  implemented  authenticated  encryption,  using  Galois  Counter  Mode  (GCM)[6,  11],  on  the  Cell 
Broadband  Engine  (CellBE)  processor[3].  An  essential  part  of  GCM  is  a  block  cipher,  here  the  Advanced 
Encryption  Standard  (AES) [8].  This  paper  details  the  process  through  which  we  developed  AES  on  the 
CellBE,  and  were  able  to  match  and  even  surpass  the  speed  benchmarks  set  by  IBM[1]. 

1.1  CellBE  Processor 

The  Cell  Broadband  Engine  (CellBE)  processor  architecture  was  designed  jointly  by  Sony,  Toshiba,  and 
IBM.  as  a  versatile  multi-processor  suitable  for  a  wide  variety  of  applications [3].  It  is  best  known  as  the 
processor  inside  the  PlaySation3,  which  has  been  very  successful. 

The  currently  available  CellBE  chip  includes  a  main  PowerPC  Processor  Element  (PPE)  along  with 
eight  “Synergistic  Processor  Elements”  (SPEs).  The  intent  is  that  the  PowerPC  processor  should  run  the 
operating  system  and  farm  out  all  the  computationally  intensive  tasks  to  the  SPEs. 

The  SPEs  have  a  different  instruction  set  using  Single  Instruction  Multiple  Data  (SIMD)  parallelism, 
with  128  registers,  each  128  bits  wide [4].  Each  SPE  includes  a  Synergistic  Processor  Unit  (SPU,  the  central 
processor),  Local  Store  memory  (LS,  256  KB),  and  a  Memory  Flow  Controller  (MFC)  that  handles  DMA 
to/from  the  LS.  The  SPU  has  two  instruction  pipelines,  called  even  and  odd,  each  of  which  handles  specific 
instruction  types.  That  is,  any  particular  instruction  is  either  even  type  (e.g.  xor)  or  odd  type  (e.g.  load). 

One  application  area  used  to  demonstrate  the  capabilities  of  this  new  processor  was  cryptography.  In 
particular,  IBM  published  speeds  for  the  Advanced  Encryption  Standard  (AES),  given  in  terms  of  throughput 
for  a  single  SPE.  Unfortunately,  IBM  did  not  publish  its  code. 

1.2  Advanced  Encryption  Standard 

The  Advanced  Encryption  Standard  (AES)  was  specified  in  2001  by  the  National  Institute  of  Standards  and 
Technology [8].  The  purpose  is  to  provide  a  standard  algorithm  for  encryption,  strong  enough  to  keep  U.S. 
government  documents  secure  for  at  least  the  next  20  years.  The  earlier  Data  Encryption  Standard  (DES) 
had  been  rendered  insecure  by  advances  in  computing  power,  and  was  effectively  replaced  by  triple-DES. 
Now  AES  will  largely  replace  triple-DES  for  government  use,  and  has  become  widely  adopted  internationally 
for  a  variety  of  encryption  needs,  such  as  secure  transactions  via  the  Internet. 

The  AES  algorithm,  previously  called  the  Rijndael  algorithm[2],  is  a  symmetric  encryption  algorithm, 
meaning  encryption  and  decryption  are  performed  by  essentially  the  same  steps.  It  is  a  block  cipher,  where 
the  data  is  encrypted/decrypted  in  blocks  of  128  bits.  (The  original  Rijndael  algorithm  allows  other  block 
sizes,  but  the  Standard  only  permits  128-bit  blocks.)  Each  data  block  is  modified  by  several  “rounds”  of 
processing,  where  each  round  involves  four  steps.  Three  different  key  sizes  are  allowed:  128  bits,  192  bits, 
or  256  bits,  and  the  corresponding  number  of  rounds  for  each  is  10  rounds,  12  rounds,  and  14  rounds.  From 
the  original  key,  a  different  “round  key”  is  computed  for  each  of  these  rounds. 

There  are  several  different  modes  in  which  AES  can  be  used  [7].  For  some  of  these,  such  as  Cipher  Block 
Chaining  (CBC),  the  result  of  encrypting  one  block  is  used  in  encrypting  the  next.  These  are  called  feedback 
modes,  and  the  feedback  effectively  precludes  processing  several  blocks  in  parallel.  Other  modes,  such  as  the 
“Electronic  Code  Book”  mode  and  “Counter”  modes,  do  not  require  feedback.  These  non-feedback  modes 
may  be  parallelized  for  greater  throughput. 

Here  we  give  a  brief  description  of  the  algorithm,  to  indicate  the  computations  involved.  The  four  steps  in 
each  round  of  encryption,  in  order,  are  called[8]  SubBytes  (byte  substitution),  ShiftRows,  MixColurrins ,  and 
AddRoundKey.  Before  the  first  round,  the  input  block  is  processed  by  AddRoundKey  (one  could  consider 
this  round  number  zero).  Also,  the  last  round  skips  the  MixColumns  step.  Otherwise,  all  rounds  are  the 
same,  except  each  uses  a  different  round  key,  and  the  output  of  one  round  becomes  the  input  for  the  next. 
(For  decryption,  the  mathematical  inverse  of  each  step  is  used,  in  reverse  order;  certain  manipulations  allow 
this  to  appear  like  the  same  steps  as  encryption  with  certain  constants  changed.) 

The  single  nonlinear  step  is  the  SubBytes  (byte  substitution)  step,  where  each  byte  (8  bits)  of  the  input 
is  replaced  by  the  result  of  applying  the  “S-box”  function  to  that  byte.  This  nonlinear  function  involves 
finding  the  inverse  of  the  8-bit  number,  considered  as  an  element  of  the  Galois  field  GF( 28).  This  is  not  a 
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simple  calculation,  and  so  AES  implementations  typically  use  a  precomputed  S-box  table,  where  the  input 
byte  is  an  index  into  the  table  to  find  the  output.  This  table  look-up  method  is  fast,  easy  to  implement,  and 
only  requires  256  bytes. 

The  other  three  steps,  ( ShiftRows ,  MixColumns,  and  AddRoundKey )  are  linear ,  in  the  sense  that  the 
output  128-bit  block  for  such  steps  is  just  the  linear  combination  (bitwise,  modulo  2)  of  the  outputs  for  each 
separate  input  bit. 

The  ShiftRows  step  considers  the  current  128-bit  state  as  a  4  x  4  matrix  of  bytes  (ordered  as  4  columns). 
This  step  rotates  each  row  of  bytes  left  by  the  row  index  (0-3);  it  just  moves  bytes  around. 

The  MixColumns  step  considers  the  state  as  4  columns  of  4  bytes  each,  and  multiplies  each  column  by  a 
constant  matrix: 
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where  byte  multiplication  and  addition  uses  the  Galois  arithmetic  of  GF( 28).  In  this  field,  each  byte  can 
be  considered  the  coefficient  vector  of  a  polynomial  of  (formal)  degree  7:  a  =  a^x‘  +  •  •  •  +  a1x  +  a0  where 
each  coefficient  a,;  is  a  bit.  Addition  (mod  2)  is  then  bitwise  XOR.  Multiplication  is  polynomial  multipli¬ 
cation,  modulo  the  irreducible  polynomial  x8  +  x4  +  x3  +  x  +  1.  Then  in  the  matrix  above,  ‘2’  (00000010) 
means  the  polynomial  x,  and  2  x  a  =  a^x8  +  •  •  •  +  aix2  +  oqx,  but  modulo  x8  +  x4  +  x3  +  x  +  1,  giving 
(  a  «  1  )  ~  (a7*  0x1  IB  )  in  C  notation.  And  3  x  a  =  a  -fi  (2  x  a).  So  MixColumns  really  only 
requires  Galois  multiplication  by  2. 

The  inverse  MixColumns  operation  uses  the  inverse  of  the  above  matrix  (shown  below  in  hexadecimal): 
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This  is  a  bit  more  complicated,  since  it  requires  multiplication  by  2,  by  4,  and  by  8  (or  repeated  multiplication 
by  2) 

These  Galois  multiplications  may  be  replaced  by  table  look-ups,  and  these  table  lookups  can  be  combined 
with  those  for  the  SubBytes  (as  suggested  by  the  developers  of  Rijndael[2]).  That  is,  ShiftRows  can  be  done 
first  in  each  round  (just  a  matter  of  indexing  correctly),  then  for  each  byte  in  a  column,  SubBytes  and 
MixColumns  requires  one  table  lookup  of  a  4-byte  column,  and  those  4  columns  are  added  (XOR)  to  give 
the  output  column.  This  approach  requires  4  tables  (a  different  table  for  each  byte  row  position),  each  of 
256  columns,  for  a  total  4  KB  of  storage.  All  the  fastest  general  software  implementations  of  AES  use  this 
approach,  which  has  been  called  the  T-table  approach. 

Lastly,  the  AddRoundKey  step  is  merely  adding  (bitwise  XOR)  the  Round  Key  to  the  current  state. 

1.3  Analysis  of  IBM’s  Results 

As  one  of  the  benchmarks  for  the  CellBE  processor,  IBM  published  timing  results  for  their  implementations 
of  AES[1].  These  results  are  given  for  a  single  SPU  processor  in  terms  of  throughput  rates  measured  in 
Giga-bits  per  second.  They  give  results  for  each  of  the  three  key  sizes,  both  ECB  and  CBC  modes,  both 
encryption  and  decryption.  We  asked  IBM  for  the  code  and  was  told  that  it  would  not  be  released. 

We  analyzed  their  numbers,  based  on  a  simple  model  for  their  unknown  code.  We  assumed  their  code  was 
structurally  similar  to  ours,  having  an  inner  loop  for  each  round,  inside  an  outer  loop  for  each  block,  where 
the  block  loop  may  be  partially  unrolled  to  process  some  small  number  of  blocks  in  parallel  (for  non-feedback 
modes).  Table  1  shows  their  rates  and  our  loop  models  for  them. 

For  each  of  the  four  modes  (ECB/CBC,  encrypt/decrypt)  all  we  have  to  work  with  are  three  numbers. 
But  based  on  this  model,  the  reciprocals  (time  per  bit)  should  fall  on  a  straight  line.  We  chose  the  axis  units 
to  be  time  in  instruction  clock  cycles  versus  rounds  per  block.  The  slope  of  that  line  indicates  the  number 
of  clock  cycles  needed  for  each  round  of  each  block,  inside  the  round  loop.  The  total  number  of  clock  cycles 
for  one  iteration  of  the  round  loop,  processing  some  number  b  of  blocks  in  parallel,  must  be  an  integer.  So 
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Table  1:  IBM’s  published  throughput  rates  (in  Gigabits/sec  for  one  SPU,  from  [1])  are  shown,  along  with 
our  models  of  the  loop  structure  of  their  code:  we  assume  a  small  number  of  blocks  is  processed  in  parallel 
(‘biles’)  inside  the  round  loop,  and  give  the  clocks  per  round  per  block,  as  well  as  the  extra  clocks  per  block 
for  the  last  round  (usually  negative).  The  last  column  shows  the  maximum  relative  error  in  our  modeled 
rates. 


IBM’s  published  results  (Gbits/sec) 

loop  model 

AES  type 

keysize 

blks 

clocks 

max 

err 

128 

192 

256 

round 

last 

ECB  encr. 

2.059 

1.710 

1.462 

4 

20.25 

-4 

0.03% 

CBC  encr. 

0.795 

0.664 

0.570 

1 

51 

3 

0.18% 

ECB  deer. 

1.499 

1.252 

1.068 

2 

27.5 

-3 

0.21% 

CBC  deer. 

1.507 

1.249 

1.066 

4 

28 

-8.75 

0.05% 

the  fractional  part  of  the  clocks/round /block  should  be  a  multiple  of  1/6.  The  intercept  of  the  line  indicates 
the  extra  clocks/block  needed  outside  the  round  loop,  that  is,  for  the  last  round  (and  round  0);  this  number 
also  should  be  a  multiple  of  1/6.  But  if  our  model  is  wrong  (say,  if  they  fully  unrolled  the  round  loop)  then 
the  points  are  unlikely  to  he  on  such  a  line. 

The  published  rates  give  three  (or  a  bit  more)  significant  digits.  The  slopes  for  our  least-squares  fit  lines 
should  have  similar  precision,  but  the  intercepts  have  less  precision  (from  cancellation).  The  fractional  part 
of  the  slope  only  has  about  one  significant  digit,  but  we  used  that  to  guess  the  number  6  of  blocks  processed 
in  parallel.  (For  CBC  encryption,  the  feedback  requires  that  6=1.  For  ECB  decryption,  the  fraction  was 
0.5,  consistent  with  either  6  =  2  or  6  =  4.) 

Our  loop  models  agree  well  with  the  published  data.  For  ECB  encryption  and  CBC  decryption,  our 
models  reproduce  the  published  throughput  rates  almost  exactly.  For  ECB  decryption,  the  three  points  do 
not  fit  a  line  so  well  (the  rate  for  192-bit  keys  seems  relatively  high);  for  CBC  encryption,  the  points  make 
a  nice  line  but  the  slope  is  not  exactly  an  integer.  But  even  in  those  cases  our  models  only  give  a  small 
difference  in  the  least  significant  digits  of  the  rates,  with  a  relative  error  of  a  fraction  of  a  percent.  The 
accuracy  of  these  models  gives  strong  support  to  our  assumptions  about  the  structure  of  their  codes. 

2  Code  Development 

Our  goal  was  to  implement  AES  on  an  SPE  and  optimize  for  speed.  In  particular,  we  needed  the  Counter 
Mode  (CTR)  of  encryption,  for  incorporation  into  the  authenticated  Galois  Counter  Mode  (GCM)[6j.  In 
Counter  Mode,  a  128-bit  counter  is  given  an  Initial  Value  (unique  IV  for  each  message  for  a  given  key). 
Then  for  each  plaintext  block,  the  counter  is  incremented  and  encrypted  using  AES  with  the  secret  Key;  the 
result  is  added  to  the  plaintext  (as  a  stream  cipher)  to  give  the  ciphertext  block.  Hence  decryption  in  CTR 
mode  is  exactly  the  same  process,  and  actual  AES  decryption  is  never  required.  (Later,  for  comparison  with 
IBM’s  results,  we  also  implemented  Electronic  Code  Book  [ECB]  encryption  and  Cipher  Block  Chaining 
[CBC]  encryption,  a  feedback  mode.) 

The  registers  in  the  SPU  are  128  bits  wide,  perfect  to  hold  the  current  state  in  the  AES  encryption. 
The  SIMD  instruction  set  includes  operations  on  whole  registers  as  a  single  “quad- word”,  or  in  parallel  as 
4  words  (each  32  bits,  one  column  of  the  AES  state)  or  as  16  bytes  (or  even  as  128  bits  in  parallel  for  such 
operations  as  XOR) .  So  we  started  by  implementing  the  basic  round  steps  with  SIMD  parallelism. 

The  first  design  consideration  was  whether  or  not  to  use  T-tables.  The  IBM  Cell  Broadband  Engine 
Programming  Handbook[4,  24.6.2]  shows  how  to  do  16  table  lookups  in  parallel  using  the  shuffle  bytes 
command  (shufb),  and  specifically  uses  the  AES  SubBytes  step  as  an  example.  Briefly,  shufb  does  lookups 
of  bytes  from  tables  in  registers,  based  on  the  lowest  5  bits  of  the  index  byte;  then  each  higher  bit  is  used  to 
successively  select  (selb)  the  correct  result.  However,  the  T-table  approach  requires  using  bytes  to  look  up 
whole  words  (4-byte  columns)  rather  than  bytes.  Doing  this  in  parallel  using  shufb  is  infeasible  (not  enough 
registers)  and  anyway  would  be  much  less  efficient  than  doing  the  lookups  sequentially  from  tables  in  Local 
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Store  memory.  We  tried  both  approaches,  parallel  SIMD  or  serial  T-tables,  arid  discuss  the  comparisons 
below.  Table  2  summarizes  the  different  versions  of  AES  we  developed,  and  shows  the  code  refinement 
process. 

2.1  SIMD  Code 

For  the  SIMD  approach,  an  entire  block  is  processed  in  parallel  parts  simultaneously,  including:  128  parallel 
bit  operations  for  AddRoundKey,  16  parallel  byte  operations  for  SubBytes,  4  parallel  word  operations  for 
MixColumns,  and  a  single  quadword  operation  for  ShiftRows.  This  parallelism  requires  replacing  any  in¬ 
struction  branching  (based  on  data  values)  with  selection  operations.  For  example,  in  Galois  multiplication 
by  2  (for  MixColumns),  after  a  left  shift  we  add  the  modulo  constant  only  if  the  leading  bit  was  1;  for  SIMD 
we  compute  both  with  and  without  the  modulo  constant,  then  bytewise  choose  (by  selb)  the  correct  result 
using  a  selector  mask  based  on  the  leading  bit  of  each  byte. 

(Note:  The  SPU  Instruction  Set [5]  is  limited  since  instructions  are  32  bits  wide  and  7  bits  are  required 
to  specify  each  register  involved  [up  to  four],  so  relatively  few  operation  codes  are  available.  Consequently, 
some  instructions  one  might  expect  are  not  available.  In  particular,  there  are  no  instructions  to  rotate  or 
shift  bytes  [only  halfwords,  words,  and  quadwords],  which  would  be  handy  for  the  Galois  multiplication  by 
2.) 

Our  initial  SIMD  code  was  a  straightforward  implementation  of  the  steps  of  a  round,  in  a  loop  for  the 
rounds,  inside  a  loop  for  each  block  (encrypted  by  Counter  Mode).  The  SubBytes  step  was  the  most  expensive 
computationally,  MixColumns  roughly  half  as  expensive,  and  the  other  steps  just  one  or  two  instructions. 
We  call  this  version  CTRO,  and  its  speed  is  about  one-quarter  that  of  the  IBM  benchmarks.  (The  closest 
comparison  for  our  CTR  mode  is  IBM’s  ECB  mode.) 

The  next  version  applied  “instruction  scheduling,”  where  we  move  instructions  around  (within  the  lim¬ 
itations  imposed  by  the  algorithm).  One  goal  here  is  to  reduce  or  eliminate  dependency  stall,  where  an 
instruction  waits  for  the  result  of  a  previous  one.  The  other  goal  of  instruction  scheduling  is  to  begin  two 
instructions  at  once,  one  in  each  pipeline  of  the  SPU;  this  is  called  dual- issue.  This  requires  the  two  instruc¬ 
tions  to  be  of  the  correct  types,  in  the  correct  order,  aligned  with  the  correct  address  parity  (even,  odd), 
with  both  instructions  ready  to  commence:  no  waiting  for  earlier  results.  (Address  alignment  may  be  ad¬ 
justed  by  inserting  no-operation  commands:  nop  or  lnop;  this  may  also  be  done  with  the  assembler  .align 
directive.)  The  ideal  would  be  for  all  instructions  to  be  dual- issued  without  any  dependency  stall,  keeping 
both  pipelines  running  nonstop.  But  the  algorithm  determines  which  instructions  are  required,  so  typically 
there  are  not  equal  numbers  of  instructions  for  each  pipeline.  Some  operations  may  be  achieved  by  different 
choices  of  instructions,  so  somtimes  instructions  for  one  pipeline  can  effectively  be  replaced  by  instructions 
for  the  other,  to  give  a  better  balance  for  more  dual-issues.  Indeed,  sometimes  using  more  instructions  to 
get  a  result  may  take  less  time  through  more  dual-issues. 

Another  related  improvement  comes  from  providing  branch  hints  in  the  code.  (The  SPU  hardware  does 
not  automatically  predict  branches.)  Without  a  branch  hint,  the  SPU  “assumes”  that  a  branch  instruction 
will  not  branch  (even  an  unconditional  branch  instruction!);  if  the  branch  is  actually  taken,  then  the  instruc¬ 
tion  queue  must  be  flushed  and  refilled,  with  a  penalty  of  18  or  19  clock  cycles,  before  execution  resumes.  A 
branch  hint  instruction  predicts  whether  a  later  branch  instruction  will  branch  or  not.  (Only  a  single  branch 
hint  may  be  in  effect  at  any  time.)  If  the  hint  is  correct  and  given  early  enough,  then  the  hinted  branch 
takes  a  single  clock  cycle  and  execution  continues;  if  the  hint  was  incorrect  the  usual  branch  penalty  applies. 
So  efficiency  can  be  enhanced  by  eliminating  branches  where  feasible  (e.g.,  using  selection  operations  selb) 
or  correctly  hinting  branches. 

Instruction  scheduling  our  code  greatly  increased  the  amount  of  dual-issues  and  reduced  dependency 
stalls.  And  we  successfully  hinted  the  branches  for  both  the  inner  round  loop  and  the  outer  block  loop 
(except  the  last  iteration  of  each  loop  does  not  branch,  so  suffers  the  penalty).  These  techniques  nearly 
doubled  the  speed;  we  call  the  resulting  code  version  CTR1. 

Next  we  considered  loop  unrolling.  If  two  or  more  iterations  of  a  loop  can  be  done  together,  then 
interleaving  their  instructions  effectively  reduces  the  data  dependency  stalls;  the  interleaved  instructions  can 
take  advantage  of  what  would  otherwise  just  be  waiting  time.  (But  note  that  such  interleaving  may  have  little 
effect  on  dual- issue  rates,  as  the  balance  of  instructions  between  pipelines  remains  unchanged.)  Furthermore, 
fully  unrolling  a  loop,  where  feasible,  can  eliminate  branch  instructions  and  counter  increments. 
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For  AES,  each  round  begins  with  the  result  of  the  previous  round,  so  successive  iterations  of  the  round 
loop  cannot  be  interleaved  this  way.  However,  for  non-feedback  encryption  modes,  such  as  CTR  or  ECB, 
the  encryption  of  each  block  is  independent  of  the  other  blocks.  So  the  block  loop  may  be  partially  unrolled 
to  interleave  instructions  for  two  or  more  blocks.  This  makes  the  code  more  complicated  and  also  requires 
using  more  registers  (several  for  each  block).  At  first  we  unrolled  to  do  two  blocks  at  once,  which  eliminated 
much  of  the  dependency  stall;  this  code  is  called  CTR2.  We  later  unrolled  two  more  blocks,  to  process  four 
blocks  at  a  time,  eliminated  all  the  remaining  dependency  stall;  this  we  call  CTR4a.  But  this  was  still  not 
as  fast  as  IBM’s  benchmark  ECB,  though  it  was  getting  close. 

The  next  improvement  came  from  rethinking  the  MixColumns  step.  (Two  versions  were  developed,  one  for 
feedback  modes  and  one  for  the  four-block  unrolled  loop,  because  they  had  different  optimizations  available.) 
One  xor  was  saved  by  reorganizing  the  algebraic  steps,  particularly  by  adding  rows  0  and  1  together  before 
doing  the  Galois  multiply  by  2.  And  the  scheduling  was  improved  by  combining  AddRoundKey  with  the 
additions  in  MixColumns.  Also,  the  dual-issue  rate  was  improved  by  replacing  some  even  pipeline  commands 
by  different  odd  pipeline  ones.  More  specifically,  some  roti  (rotate)  instructions  were  replaced  by  shufb 
instructions,  a  selb  (select)  became  two  shufb  instructions,  and  for  one  of  the  four  blocks,  a  comparison 
instruction  was  replaced  by  four  odd  pipeline  instructions. 

Further  instruction  scheduling  was  applied  in  the  four-block  version,  to  take  advantage  of  more  dual¬ 
issues.  This  included  preparing  for  the  next  iteration  of  blocks  while  finishing  the  last  round  of  the  current 
blocks,  and  interleaving  some  instructions  from  MixColumns  for  some  blocks  with  the  SubBytes  for  other 
blocks. 

Finally,  another  improvement  was  dynamic  branch  hinting.  By  using  a  table  of  branch  hint  addresses, 
we  could  correctly  hint  even  the  last  iteration  of  the  round  loop.  This  alone  gave  a  further  3%  speedup  (in 
the  one-block  version). 

At  this  point,  we  have  a  highly  optimized  version  of  AES  in  Counter  mode,  which  encrypts  four  blocks 
at  a  time,  called  CTR4.  Within  the  block  and  round  loops  (and  mostly  elsewhere):  every  odd-pipeline 
instruction  is  dual-issued  (there  are  more  even-pipeline  instructions);  there  are  no  dependency  stalls;  all 
branches  are  correctly  hinted  (except  the  final  iteration  of  the  block  loop). 

The  only  further  improvement  we  could  see  would  be  to  fully  unroll  the  round  loop.  This  would  not 
help  the  instruction  scheduling  any,  since  already  there  is  no  dependency  stall  and  no  more  possibilities  for 
dual  issue.  Also  the  branch  itself  is  dual  issued  and  properly  hinted  so  takes  no  time.  The  one  apparent 
improvement  comes  from  eliminating  the  single  (even-pipeline)  instruction  that  increments  the  round  counter 
itself.  (The  instructions  that  load  and  issue  the  branch  hints  for  the  round  loop  could  also  be  eliminated, 
but  since  these  are  odd-pipeline  instructions  dual  issued  with  essential  even-pipeline  commands,  eliminating 
them  would  save  no  time.)  Since  we  process  four  blocks  at  a  time,  this  only  helps  by  |  cycle/block/round. 
The  downsides  would  be  requiring  three  different  versions  of  the  encryption  code,  one  for  each  key  length, 
and  each  of  these  unrolled  codes  would  be  much  longer  (by  roughly  4  to  6  times).  So  we  have  chosen  not  to 
unroll  the  round  loop. 

2.2  Other  Encryption  Modes 

Besides  Counter  mode,  we  also  developed  code  versions  for  other  modes  of  encryption,  primarily  for  direct 
comparison  with  IBM’s  results. 

Electronic  Codebook  (ECB)  mode  is  very  similar  to  Counter  mode,  except  the  AES  rounds  are  applied 
to  the  plaintext  block,  rather  than  to  a  counter.  This  saves  two  operations  per  block,  relative  to  Counter 
mode:  no  counter  block  is  incremented  nor  added  to  the  plaintext.  So  our  ECB  code  is  slightly  faster  than 
our  corresponding  CTR  code.  And  since  each  block  is  encrypted  independently,  we  can  partially  unroll  the 
block  loop  as  in  CTR  mode.  Hence  our  ECB  encryption  code  is  very  similar  to  our  CTR  code. 

We  did  not  develop  code  for  ECB  decryption,  nor  any  other  mode  requiring  the  AES  decryption  function, 
also  called  the  inverse  cipher.  The  inverse  cipher  is  more  complicated  due  to  the  larger  factors  in  the  inverse 
MixColumns  matrix.  (IBM’s  results  show  a  decrease  in  throughput  for  ECB  decryption.) 

Cipher  Block  Chaining  (CBC)  mode  begins  encryption  of  a  plaintext  block  by  adding  the  ciphertext  from 
the  previous  block  (except  the  first  block  uses  an  Initial  Value  instead  of  the  ciphertext  block).  This  feedback 
increases  security,  but  prevents  any  unrolling  of  the  block  loop.  Since  only  a  single  block  is  processed  at  a 
time,  opportunities  for  instruction  scheduling  are  greatly  limited,  compared  to  the  non-feedback  modes.  So 
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Table  2:  Here  we  compare  several  different  versions  we  have  developed. 


throughput  results  (Gbit/sec) 

loop  model 

code 

keysize 

blks 

clocks 

128 

192 

256 

round 

last 

our  SIMD  CTR  results: 

CTRO 

0.496 

0.411 

0.351 

1 

85 

-26 

CTR1 

0.867 

0.731 

0.631 

1 

44 

31 

CTR2 

1.431 

1.196 

1.028 

2 

28 

5.5 

CTR4a 

1.872 

1.555 

1.330 

4 

22.25 

-4.25 

CTR4 

2.071 

1.722 

1.474 

4 

20 

-2.75 

our  T-table  CTR  results: 

Tabl 

0.827 

0.692 

0.596 

1 

48 

14 

Tab  2 

1.084 

0.914 

0.790 

1 

35 

27 

our  CBC  results: 

CBC1 

0.898 

0.752 

0.647 

1 

44 

15 

CBC2 

1.191 

0.989 

0.846 

1 

35 

-7 

our  ECB  results: 

ECB1 

1.058 

0.884 

0.759 

1 

38 

6 

ECB4a 

1.976 

1.639 

1.400 

4 

21.25 

-5.75 

ECB4 

2.092 

1.737 

1.484 

4 

20 

-4.75 

the  time  per  block  is  increased  due  to  unavoidable  data  dependence  waits  and  fewer  dual  issues;  our  resulting 
CBC  code  is  roughly  half  as  fast  as  the  CTR  version.  (CBC  decryption  can  process  blocks  in  parallel,  using 
the  inverse  AES  cipher;  we  did  not  develop  code  for  this.) 

Besides  ECB,  CBC,  and  CTR  modes,  NIST  has  approved  two  other  modes  for  security [7].  Cipher 
Feedback  (CFB)  mode  and  Output  Feedback  (OFB)  mode  both  need  the  output  of  encrypting  the  previous 
block  before  they  can  begin  encrypting  the  next  block,  so  cannot  encrypt  blocks  in  parallel.  Both  also  add 
(xor)  the  result  of  an  AES  encryption  to  the  plaintext  block  to  get  the  ciphertext.  Hence,  for  decryption, 
both  use  only  the  forward  AES  algorithm,  not  the  inverse  cipher.  CFB  can  decrypt  blocks  in  parallel,  but 
not  OFB.  (We  did  not  develop  codes  for  these  modes,  though  they  would  be  relatively  simple  modifications 
to  versions  we  did  develop.) 

NIST  has  also  approved  three  authentication  modes  based  on  block  encryption:  Cipher-based  Mes¬ 
sage  Authentication  Code  (CMAC)[9]  essentially  uses  CBC  encryption  to  generate  an  authentication  hash; 
Counter  with  Cipher  Block  Chaining-Message  Authentication  Code  (CCM)[10]  combines  CTR  mode  for 
encryption  with  CBC  mode  for  authentication;  Galois/Counter  Mode  (GCM)[11]  uses  CTR  mode  for  en¬ 
cryption  with  a  separate  hash  function  not  based  on  encryption.  (Our  main  goal  was  to  produce  fast  GCM 
encryption/decryption,  which  is  why  our  main  interest  in  AES  is  the  CTR  mode.)  None  of  these  authenti¬ 
cation  modes  uses  the  inverse  cipher. 

2.3  T-table  Code 

Since  fast  software  implementations  of  AES  typically  use  the  “T-table”  approach  (where  table  look-ups  handle 
the  combined  SubBytes  and  MixColumns  steps),  we  wanted  to  try  this  on  the  CellBE.  So  we  developed  a 
T-table  code  to  investigate  how  the  algorithmic  parallelism  of  the  T-table  method  compares  with  the  SIMD 
parallelism  available  on  the  SPU. 

In  the  usual  software  implementation,  for  each  column  (4  bytes  =  1  word)  of  output,  each  of  the  four 
bytes  of  input  indexes  a  different  table  of  256  words,  and  those  four  words  are  added  (xor)  together.  This 
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requires  4  tables  x  256  entries  x  4  Bytes  =  4KB  of  storage  for  tables.  On  the  SPU,  speed  dictates  that  each 
lookup  returns  a  quadword  (16  bytes  =  1  register  =  1  block),  since  otherwise  several  more  instructions  would 
be  required  to  get  the  desired  word  into  the  desired  position  in  a  register.  So  we  set  up  16  tables  (four  for 
each  column  of  output,  with  zeros  in  the  other  column  positions),  and  each  of  the  16  input  bytes  indexes 
one  of  those  tables,  with  the  16  output  quadwords  getting  summed  for  the  result.  Altogether  this  requires 
16  tables  x  256  entries  x  16  Bytes  =  64KB,  or  |  of  the  total  Local  Store  memory  of  an  SPE. 

The  lookups  are  done  for  each  byte  in  serial  fashion,  which  might  normally  suggest  a  loop  over  the  16 
bytes.  But  we  fully  unrolled  this  (potential)  byte  loop,  which  allows  us  to  replace  the  ShiftRows  step  by 
choosing  the  shifted  index  in  each  case.  For  each  of  the  16  table  lookups  in  a  round,  the  corresponding  byte 
first  must  be  moved  to  the  correct  position  in  the  “preferred  slot”  of  a  register,  with  all  higher  bits  of  that 
word  zeroed  out.  Different  approaches  to  do  this  were  combined  to  balance  the  two  pipelines. 

By  the  way,  the  exact  same  approach  is  used  in  the  Galois  Hash  operation  of  GCM.  There,  the  operation 
performs  multiplication  of  a  128-bit  data  block  with  a  known  128-bit  constant  H  in  the  Galois  field  GF(  2128). 
Sixteen  tables,  each  one  block  wide  by  256  long,  are  precomputed  from  H  to  give  the  contribution  to  the 
product  from  each  byte  of  the  data  block.  Then  this  Galois  multiplication  consists  of  using  each  input  byte 
to  index  a  different  table  and  adding  up  (xor)  all  16  of  the  128-bit  contributions  (by  the  distributive  property 
of  multiplication). 

Our  T-table  implementation  (of  CTR  mode)  has  no  unrolling  of  the  block  loop  (nor  the  round  loop). 
The  round  loop  requires  35  clocks  per  round;  the  last  round  takes  longer.  (Since  the  last  round  lacks  the 
MixColumns  step,  the  T-table  method  requires  additional  instructions  to  mask  the  table  outputs.)  Although 
we  did  not  develop  a  multi-block  version  using  T-tables,  we  can  estimate  how  much  improvement  is  possible: 
it  appears  the  best  we  might  achieve  by  partially  unrolling  the  block  loop  would  be  over  27  clocks  per  round. 

One  other  improvement  for  the  T-table  approach  would  be  the  rather  obscure  trick  called  “counter-mode 
caching.”  For  15  out  of  16  blocks,  only  the  least  significant  byte  of  the  CTR  changes  from  the  previous  value. 
Then  for  the  first  round,  only  that  byte  needs  a  table  look-up;  the  rest  can  be  cached  from  the  last  block’s 
first  round.  (This  trick  doesn’t  help  the  SIMD  approach,  since  all  bytes  are  processed  in  parallel.)  We  have 
not  implemented  this,  but  estimate  that  counter-mode  caching  would  improve  the  throughput  rates  by  no 
more  than  6%  for  the  one-block  version.  (This  caching  trick  would  not  be  feasible  for  multi-block  versions. 
But  for  GCM,  only  the  four  least  significant  bytes  of  the  counter  ever  change,  so  the  results  of  the  first  round 
for  the  remaining  12  bytes  could  be  cached.) 

So  how  does  the  T-table  method  compare  to  the  SIMD  approach?  In  terms  of  memory,  T-tables  require 
an  extra  64  KB.  The  speed  comparison  depends  on  the  mode.  For  non-feedback  modes  of  encryption, 
such  as  CTR  mode,  our  4-block  SIMD  version  is  much  faster  than  the  T-table  approach  (about  45%  faster 
than  our  estimate  for  a  multi-block  table  version).  Hence  “counter- mode  caching”  is  moot.  For  feedback 
modes  of  encryption,  such  as  CBC  mode,  our  1-block  SIMD  version  is  slightly  faster  (about  8%)  than  the 
T-table  approach.  (Both  approaches  take  35  clocks/round  in  the  round  loop;  the  difference  is  in  the  last 
round.  Conceivably  one  could  graft  T-table  rounds  to  a  SIMD  last  round  to  get  a  version  just  as  fast  as  our 
pure-SIMD  CBC  code.) 

But  for  AES  decryption ,  the  SIMD  approach  gets  more  complicated  due  to  the  larger  factors  in  the 
inverse  MixColumns ,  while  the  T-table  approach  remains  essentially  unchanged,  except  for  using  a  different 
64  KB  set  of  tables.  We  did  not  implement  decryption,  but  judging  from  IBM’s  results,  SIMD  decryption 
for  non-feedback  modes  should  take  about  27.5  clocks/round,  comparable  to  the  T-table  approach.  But 
for  decryption  modes  requiring  feedback,  we  expect  the  T-table  approach  to  be  significantly  faster  than 
SIMD.  However,  none  of  the  five  security  or  three  authentication  modes  approved  by  NIST  use  the  inverse 
AES  cipher  with  feedback  in  decrypting,  so  this  potential  advantage  of  T-tables  might  only  apply  to  some 
non-standard  mode.  Therefore,  T-tables  offer  no  significant  speed  advantages  for  any  standard  modes  on 
the  SPU,  yet  carry  the  significant  cost  of  using  |  of  the  Local  Store  memory  (or  |  if  both  AES  encryption 
and  decryption  are  needed). 

3  Results  and  Conclusions 

We  have  successfully  developed  fast  versions  of  AES  for  the  Synergistic  Processor  Elements  of  the  CellBE 
processor.  Our  main  interest  was  CTR  mode,  as  part  of  Galois  Counter  Mode  authenticated  encryption, 
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Table  3:  We  compare  our  measured  throughput  rates  (for  one  SPU)  with  those  published  by  IBM.  Also 
shown  (using  our  models  for  IBM’s  results):  number  of  blocks  processed  in  parallel,  clocks  per  round  per 
block,  and  extra  clocks  per  block  for  the  last  round. 


throughput  results  (Gbit /sec) 

loop  model 

who 

keysize 

blks 

clocks 

128 

192 

256 

round 

last 

ECB  encr  (no  feedback) 

1: 

ours 

2.092 

1.737 

1.484 

4 

20 

-4.75 

IBM’s 

2.059 

1.710 

1.462 

4 

20.25 

-4 

CBC  encr 

(feedback  mode): 

ours 

1.191 

0.989 

0.846 

1 

35 

-7 

IBM’s 

0.795 

0.664 

0.570 

1 

51 

3 

but  we  also  developed  versions  for  ECB  and  CBC  encryption  modes.  Table  3  compares  our  results  with  the 
IBM  benchmarks,  for  the  two  modes  implemented  by  both  teams.  We  measured  the  throughput  rates  for 
our  code  using  the  system  clock  to  find  the  time  taken  for  our  subroutine  to  encrypt  a  buffer  full  of  blocks. 

Our  implementation  of  ECB  encryption  is  slightly  faster  than  IBM’s  (1.6%  for  128-bit  keys).  Compared 
to  our  loop  model  of  their  code,  we  were  able  to  save  one  more  instruction  per  four  blocks  in  the  round 
loop  (by  replacing  an  even  pipeline  instruction  by  four  odd  pipeline  instructions  as  mentioned  above).  More 
importantly,  we  are  willing  to  make  our  code  public,  which  IBM  is  not. 

And  for  CBC  encryption,  our  implementation  is  50%  faster  (for  128-bit  keys),  a  significant  improvement 
over  the  IBM  benchmark.  (We  remain  curious  why  there  is  such  a  difference  for  CBC  mode.) 

In  developing  our  AES  code,  we  compared  the  T-table  approach  (found  in  all  the  fastest  standard  C 
implementations  of  AES),  which  uses  serial  table  lookups,  with  the  SIMD  approach  of  processing  a  whole 
block  in  parallel.  For  non-feedback  encryption  modes  SIMD  is  much  faster  (approximately  45%).  For 
feedback  modes  of  encryption  and  non-feedback  decryption  modes,  T-tables  are  basically  the  same  speed1  as 
SIMD  but  use  up  at  least  |  of  the  Local  Store  memory.  There  are  no  standard  modes  where  AES  decryption 
must  be  done  using  feedback,  but  if  there  were,  T-tables  would  likely  be  faster  than  SIMD  for  those.  So  for 
all  standard  modes,  there  is  no  reason  to  use  T-tables  on  an  SPU. 

The  method  we  used  to  develop  fast  code  follows  the  suggestions  in  the  IBM  documentation  for  pro¬ 
gramming  the  SPU[4].  While  the  IBM  programming  environment  provides  great  support  for  writing  in  a 
high  level  language  such  as  C,  including  ways  to  include  particular  assembly  language  instructions,  we  chose 
to  develop  the  most  time-intensive  portions  of  GCAI  (including  AES)  directly  in  assembly  language.  The 
first  step  was  to  arrange  the  algorithm  to  take  full  advantage  of  the  SIAID  architecture  of  the  SPU,  includ¬ 
ing  replacing  data-dependent  branching  by  selection  operations.  Then  instructions  were  scheduled  (moved 
around),  with  the  help  of  partial  loop  unrolling  where  feasible,  to  reduce  the  number  of  cycles  where  one  or 
both  pipelines  was  idly  waiting  for  a  previous  result.  This  included  moving  instructions  from  one  pipeline 
to  equivalent  instructions  on  the  other  in  order  to  balance  the  load,  to  get  both  pipelines  done  sooner.  And 
correctly  hinting  the  remaining  branches  as  often  as  possible  eliminated  instruction  cache  waits. 

Our  independent  development  of  AES  on  the  CellBE  makes  fast  encryption  code  publicly  available,  and 
adds  more  confirmation  of  the  powerful  capabilities  of  the  CellBE  architecture. 


1based  partly  on  IBM’s  results,  assuming  their  decryption  was  SIMD 
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A  Optimization  of  MixColumns 

Here  we  detail  the  steps  by  which  we  optimized  the  MixColumns  step,  including  the  relevant  assembly 
language  source  code  (taken  out  of  context).  This  section  shows  most  of  the  interesting  optimizations  of  the 
round  loop,  since  our  implementation  of  SubBytes  basically  follows  the  SIMD  table  lookup  given  in  the  IBM 
Programming  Handbook[4]. 

Considering  the  128-bit  state  block  (register)  as  a  4  x  4  matrix  of  bytes,  then  MixColumns  performs 
the  same  operation  on  each  of  the  4  columns  (words  in  the  register).  For  an  input  column  (ro,ri,r-2,rs), 
the  top  output  byte  (#0)  is  given  by  2  x  tq  +  3  x  r\  +  r-2  +  r 3,  and  the  other  output  bytes  are  the  rotated 
equivalent  (so  output  #1  =  2  x  q  +  3  x  rj  +  ...,  etc.)  The  multiplication  is  in  the  Galois  field  of  bytes,  so 
to  multiply  by  2  one  shifts  left  1  bit  then  reduces  modulo  the  field  polynomial,  represented  by  the  nine-bit 
constant  OxllB.  (If  the  most  significant  bit  was  initially  0,  the  result  is  the  usual  multiply  by  2.)  And  as 
usual,  3  x  x  =  2  x  x  +  x,  except  each  addition  is  bitwise  xor. 

The  initial  assembly  version  of  this  (in  CTRO)  was  a  direct  SIMD  implementation:  clear  msb  of  bytes 
then  shift  quadword  left  by  1  bit  (this  could  be  done  in  one  step  if  there  were  a  “shift  byte”  instruction); 
maybe  add  OxlB,  using  byte  selector  based  on  msb  (bit?) ,  to  get  2  x  x\  add  original  byte  to  get  3  x  x;  rotate 
columns  and  add  rows.  (Note:  to  aid  readibility,  our  assembly  source  uses  named  registers,  beginning  $R; 
pipeline  0  instructions  are  flush  left  while  pipeline  1  instructions  are  indented;  dual-issued  instruction  pairs 
are  indicated  by  braces.) 


SIMD  version 

#0  of  Mix  Columns 

andbi 

$Rtimes2,  $Rstate,  0x7F 

# 

ain’t  no  "shift  byte";  clear  msb 

shlqbii  $Rtimes2,  $Rtimes2,  1 

# 

shift  block  1  bit 

xorbi 

$Rtimes2m,  $Rtimes2,  OxlB 

# 

mod  field  polynomial 

clgtbi 

$Rbit7,  $Rstate,  0x7F 

# 

if  msb  =  1 

selb 

$Rtimes2,  $Rtimes2,  $Rtimes2m,  $Rbit7 

# 

now  have  byte  x  2  in  GF 

xor 

$Rtimes3,  $Rtimes2,  $Rstate 

# 

also  byte  x  3 

roti 

$Rrowl,  $Rtimes3,  8 

# 

rotate  columns  and  add: 

xor 

$Rcols,  $Rtimes2,  $Rrowl 

# 

2  x  rO  +  3  x  rl 

roti 

$Rrow2,  $Rstate,  16 

xor 

$Rcols,  $Rcols,  $Rrow2 

# 

+  1  x  r2 

roti 

$Rrow3,  $Rstate,  24 

xor 

$Rstate,  $Rcols,  $Rrow3 

# 

+  1  x  r3 ,  and  done 

The  next  version  (in  CTR1)  was  essentially  the  same  steps,  but 

in 

1  a  different  order  (instruction  schedul- 

:g),  to  get  some 

dual  issues  and  reduce  data  dependency  stall: 

SIMD  version 

#1  of  Mix  Columns 

andbi 

$Rtimes2,  $Rstate,  0x7F 

# 

no  "shift  byte";  clear  msb 

clgtbi 

$Rbit7,  $Rstate,  0x7F 

# 

if  msb  =  1 

dual  issue: 

roti 

$Rrow2,  $Rstate,  16 

shlqbii  $Rtimes2,  $Rtimes2,  1 

# 

shift  block  1  bit 

dual  issue: 

roti 

$Rrow3,  $Rstate,  24 

lqx 

$Rroundkey,  $Rroundkeys,  $Rround 

# 

get  round  key 

xorbi 

$Rtimes2m,  $Rtimes2,  OxlB 

# 

mod  field  polynomial 

selb 

$Rtimes2,  $Rtimes2,  $Rtimes2m,  $Rbit7 

# 

now  have  byte  x  2  in  GF 

xor 

$Rtimes3,  $Rtimes2,  $Rstate 

# 

also  byte  x  3 

roti 

$Rrowl,  $Rtimes3,  8 

# 

rotate  columns  and  add: 

xor 

$Rcols,  $Rtimes2,  $Rrow2 

# 

2  x  rO  +  1  x  r2 

xor 

$Rcols,  $Rcols,  $Rrow3 

# 

+  1  x  r3 

xor 

$Rstate,  $Rcols,  $Rrowl 

# 

+  3  x  rl,  and  done 

Partially  unrolling  the  block  loop  allowed  reduction  (CTR2)  or  elimination  (CTR4a)  of  the  remaining 
data  dependency  stall,  by  interleaving  instructions  for  2  or  4  blocks  to  fill  in  the  “wait”  cycles.  At  this  point, 
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we  also  reconsidered  the  overall  approach  to  MixColumns.  One  change  was  adding  rows  0  and  1  first,  before 
the  multiply  by  2:  so  2  x  ro  +  3  x  ri  +  ro  +  became  2  x  (ro  +  r i)  4-  r i  +  (r 2  4-  ^3);  this  eliminated  one 
xor  and  one  roti.  Another  improvement  came  from  integrating  ShiftRows  and  AddRoundKey  in  as  well,  for 
better  instruction  scheduling.  The  third  change  involved  moving  instructions  from  pipeline  0  (even),  where 
most  of  them  were,  to  pipeline  1  (odd),  to  allow  more  dual  issues:  the  remaining  two  roti  instructions  were 
replaced  by  two  shufb  ones.  Here  some  dual  issues  come  from  interleaving  with  other  blocks,  but  we  show 
only  those  in  one  block. 


#  SIMD  version 

#2  &  #4a  of  Shift  Rows  and  Mix  Columns 

and 

shufb 

$Rrowl,  $Rstate,  $Rstate,  $Rshiftrowl 

# 

J  xor 

$Rrows ,  $Rrowl ,  $Rroundkey 

# 

l  shufb 

$Rrow0,  $Rstate,  $Rstate,  $Rshif trows 

# 

xor 

$Rrow01,  $Rrow0,  $Rrowl 

# 

r  clgtbi 

$Rbit7,  $Rrow01,  0x7F 

# 

1  shufb 

$Rrow23,  $Rrow01,  $Rrow01,  $Rrotrow2 

# 

J  xor 

$Rrows,  $Rrows,  $Rrow23 

# 

l  shlqbii 

.  $Rtimes2,  $Rrow01,  1 

# 

andbi 

$Rtimes2,  $Rtimes2,  OxFE 

# 

xorbi 

$Rtimes2m,  $Rtimes2,  OxlB 

# 

selb 

$Rtimes2,  $Rtimes2,  $Rtimes2m,  $Rbit7 

# 

xor 

$Rstate,  $Rrows,  $Rtimes2 

# 

Add  Round  Key 
move  bytes:  row  1 
1  +  RK 

move  bytes  around:  row  0 

(0+1) 

mult  2* (0+1)  in  GF 
2+3 

1+2+3  +  RK 
shift  1 

clear  lsb  (was  msb) 
mod  field  polynomial 
now  have  2* (0+1)  in  GF 
2* (0+1)  +  (1+2+3)  +  RK 


By  this  point  (CTR4a),  all  the  pipeline  1  instructions  were  dual-issued  (within  the  loops),  though  there 
were  many  pipeline  0  instructions  left  over.  But  judging  by  IBM’s  times,  there  was  still  room  for  improvement, 
by  one  more  clock  cycle  per  round  per  block.  We  couldn’t  find  any  way  to  eliminate  more  instructions.  So 
the  only  option  was  to  move  more  instructions  from  pipeline  0  to  pipeline  1.  Fortunately,  we  found  ways  to 
do  this,  using  some  of  the  quirky  pipeline  1  instructions.  The  shuffle  bytes  shufb  instruction  does  special 
things  if  the  msb  of  the  input  byte  is  1  (otherwise  it  picks  a  byte  based  on  the  5  lowest  bits);  in  particular, 
repeated  application  could  give  the  sequence  OxFF  -+  0x80  ->  0x00.  In  this  way,  we  replaced  one  selection 
selb  by  two  shufbs,  though  it  required  reversing  the  comparison  cgtbi:  if  the  msb  was  0,  the  comparison 
gave  OxFF,  but  if  the  msb  was  1  then  0x00;  after  two  shufbs  using  a  register  full  of  the  field  polynomial 
byte,  then  the  result  byte  was  0x00  or  OxlB  respectively,  the  correct  value  to  add  for  the  Galois  multiply. 
This  saves  one  cycle  per  round  per  block,  by  eliminating  a  pipeline  0  command,  basically  matching  IBM’s 
timing.  In  our  final  version  (CTR4),  this  approach  applies  for  3  of  the  4  blocks  each  round: 


#  SIMD  version  #4  (3  of  4  blocks)  of  Shift  Rows,  Mix  Columns,  Add  Round  Key 


shufb 

$Rrowl,  $Rstate,  $Rstate,  $Rshiftrowl 

# 

move  bytes:  row  1 

r  xor 

$Rrows ,  $Rrowl ,  $Rroundkey 

# 

1  +  RK 

\ 

shufb 

$Rrow0,  $Rstate,  $Rstate,  $Rshif trows 

# 

move  bytes  around:  row  0 

xor 

$Rrow01,  $Rrow0,  $Rrowl 

# 

(0+1) 

f  cgtbi 

$Rbit7,  $Rrow01,  -1 

# 

B 

Ui 

ct 

II 

0 

1 

V 

II 

I-4- 

1 

V 

0 

0 

i 

shufb 

$Rrow23,  $Rrow01,  $Rrow01,  $Rrotrow2 

# 

2+3 

r  xor 

$Rrows,  $Rrows,  $Rrow23 

# 

1+2+3  +  RK 

i 

shlqbii 

$Rtimes2,  $Rrow01,  1 

# 

shift  1 

#  Note: 

in  $Rmod  each  byte  =  OxlB 

C  andbi 

$Rtimes2,  $Rtimes2,  OxFE 

# 

clear  lsb 

i 

shufb 

$Rbit7,  $Rmod,  $Rmod,  $Rbit7 

# 

FF  ->  80,  00  ->  IB 

f  xor 

$Rrows,  $Rrows,  $Rtimes2 

# 

2* (0+1)  +  (1+2+3)  +  RK 

i 

shufb 

$Rbit7,  $Rmod,  $Rmod,  $Rbit7 

# 

80  ->  00,  IB  ->  IB 

xor 

$Rstate,  $Rrows,  $Rbit7 

# 

mod  GF  poly 

And  for  our  final  magic  trick,  we  were  able  to  move  one  more  instruction  from  pipeline  0,  but  only  for  one  of 
the  four  blocks  each  round.  The  comparison  instruction  clgtbi,  which  generates  a  byte  of  all  0s  or  Is  based 
on  the  msb,  can  be  replaced  using  “gather  bits  from  bytes”  gbb  (gets  all  16  lsb’s)  followed  by  “form  select 
mask  for  bytes”  fsmb  (repeats  each  of  those  16  bits  8  times).  Since  this  uses  the  lsb  rather  than  the  msb, 
it  must  be  done  after  the  shift  (which  itself  must  become  a  quadword  rotate  instead),  so  requires  another 
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quadword  rotate  back  by  a  byte  to  put  the  mask  back  with  its  byte  of  origin.  Also,  since  this  does  not  reverse 
the  sense  of  the  comparison  (as  needed  for  the  previous  trick),  one  additional  shufb  is  required  to  get  the 
selection  right.  In  short,  one  pipeline  0  instruction  clgtbi  of  duration  2  cycles  gets  removed,  and  later  four 
pipeline  1  instructions,  each  of  duration  4  cycles,  get  inserted.  This  is  why  it  was  only  possible  for  one  out 
of  four  blocks:  lots  of  other  instructions  were  needed  to  fill  in  all  that  time;  but  with  massive  rescheduling 
of  instructions,  it  worked  out.  This  trick  saved  one  cycle  per  round  for  every  4  blocks  (and  beat  IBM).  So 
for  one  block  in  CTR4,  it  looks  like  this  (note  that  all  pipeline  1  instructions  get  dual  issued  by  interleaving 
with  other  blocks;  again  only  dual  issues  within  the  block  are  shown): 


#  SIMD  version  #4  (1  of  4  blocks)  of  Shift  Rows,  Mix  Columns,  Add  Round  Key 

shufb  $Rrowl,  $Rstate,  $Rstate,  $Rshiftrowl  #  move  bytes:  row  1 

shufb  $RrowO,  $Rstate,  $Rstate,  $Rshif trows  #  move  bytes  around:  row  0 

xor  $Rrow01,  $RrowO,  $Rrowl  #  (0+1) 

rotqbii  $Rtimes2,  $Rrow01,  1  #  mul  by  2 

gbb  $Rbit7,  $Rtimes2  #  get  lsb  (was  msb) 

fsmb  $Rbit7,  $Rbit7  #  byte  selector 

rotqbyi  $Rbit7,  $Rbit7,  -1  #  rot  back  to  source  byte 

f  xor  $Rrows ,  $Rrowl ,  $Rroundkey  #  1  +  RK 

l  shufb  $Rrow23,  $Rrow01,  $Rrow01,  $Rrotrow2  #  2+3 

#  Note:  in  $Rmod  each  byte  =  OxlB;  in  $Rzero  each  byte  =  0x00 


{ 

{ 

{ 


xor 

$Rrows,  $Rrows,  $Rrow23 

# 

1+2+3  +  RK 

shufb 

$Rbit7,  $Rmod,  $Rmod,  $Rbit7 

# 

00  ->  IB,  FF  -> 

80 

andbi 

$Rtimes2,  $Rtimes2,  OxFE 

# 

clear  lsb 

shufb 

$Rbit7,  $Rmod,  $Rmod,  $Rbit7 

# 

IB  ->  IB,  80  -> 

00 

xor 

$Rrows,  $Rrows,  $Rtimes2 

# 

2* (0+1)  +  (1+2+3)  +  RK 

shufb 

$Rbit7,  $Rmod,  $Rzero,  $Rbit7 

# 

IB  ->  00,  00  -> 

IB 

xor 

$Rstate,  $Rrows,  $Rbit7 

# 

mod  GF  poly 
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B  Initial  AES  CTR  Assembly  Code 

This  version  was  our  first  attempt  to  use  the  SPU  Assembly  language  to  implement  AES  encryption:  CTRO. 
The  SIMD  instructions  process  all  parts  of  a  block  in  parallel.  The  SubBytes  table  lookup  is  based  on  that 
given  in  the  IBM  Programming  Handbook.  The  rest  is  implemented  in  a  direct  manner,  in  a  way  that  seems 
logical  from  a  programmer’s  point  of  view,  so  this  is  fairly  readable.  But  the  instructions  are  not  in  the  most 
efficient  order  from  the  machine’s  viewpoint:  there  is  a  lot  of  data  dependency  stall  and  no  dual  issues. 

The  format  is  as  in  the  optimization  examples  above:  named  registers  begin  $R  and  statement  labels 
begin  L;  pipeline  0  instructions  are  flush  left  while  pipeline  1  instructions  are  indented. 

##  AES  function,  CTR  mode,  basic  version  (0)  2008  Mar  24  Mon  20:42:10 

##  5  input  parameters:  (NO  error  checking) 

##  pointer  to  data  buffer 

##  pointer  to  Round  Key  buffer 

##  number  of  data  blocks  (must  be  compatible  with  length  of  data  buffer) 

##  number  of  rounds  (must  be  compatible  with  length  of  Round  Key  buffer) 

##  counter  value  for  first  data  block 
##  1  output  parameter: 

##  counter  value  for  next  data  block 


.file  "aes_ctr.s" 

.section  mydata, "a" , Sprogbits 

.align  4 

Sbox : 


. octa  0x637C777BF26B6FC53001672BFED7AB76 
. octa  0xCA82C97DFA5947F0ADD4A2AF9CA472C0 
. octa  0xB7FD9326363FF7CC34A5E5F171D83115 
. octa  0x04C723C31896059A071280E2EB27B275 
. octa  0x09832C1A1B6E5AA0523BD6B329E32F84 
. octa  0x53D100ED20FCB15B6ACBBE394A4C58CF 
. octa  0xD0EFAAFB434D338545F9027F503C9FA8 
. octa  0x51A3408F929D38F5BCB6DA2110FFF3D2 
. octa  0xCD0C13EC5F974417C4A77E3D645D1973 
. octa  0x608 14FDC222A908846EEB814DE5E0BDB 
. octa  0xE0323A0A4906245CC2D3AC629195E479 
. octa  0xE7C8376D8DD54EA96C56F4EA657AAE08 
. octa  0xBA78252ElCA6B4C6E8DD741F4BBD8B8A 
. octa  0x703EB5664803F60E613557B986CllD9E 
. octa  0xE1F8981169D98E949B1E87E9CE5528DF 
. octa  0x8CA1890DBFE6426841992D0FB054BB16 
Shif tRows : 

. octa  0x00050A0F04090E03080D02070C01060B 

Incr : 

. octa  0x00000000000000000000000000000001 

.text 

.align  3 

.global  aes_ctr 

.type  aes_ctr,  @f unction 

##REGISTER  DEFINITIONS## 


set 

Rin_dat , 

,  3 

# 

1st 

param  =  ptr  to 

block 

set 

Rin_key , 

,  4 

# 

2nd 

param  =  ptr  to 

keys 

set 

Rin_nb , 

5 

# 

3rd 

param  =  number 

of  blocks 

set 

Rin_nr , 

6 

# 

4th 

param  =  number 

of  rounds 

set 

Rin_ctr , 

,  7 

# 

5th 

param  =  counter  initial 

13 


.set  Rout_ctr,  3  #  output  param  =  counter  next  value 


set 

RTOP ,  79 

# 

last  volatile 

reg 

set 

Rnrounds ,  RTOP 

-  20 

# 

#  of  Rounds 

set 

Rincr,  RTOP  ■ 

- 

19 

# 

increment  for  CTR 

set 

Rdat ,  RTOP  - 

18 

# 

1st  param  = 

ptr  to  block 

set 

Rroundkeys ,  RTOP  - 

-  17 

# 

Keys  Ptr  (const) 

set 

Rshif trows,  RTOP  - 

-  16 

# 

ShiftRows  (const) 

set 

RsboxO , 

RTOP 

- 

15 

# 

S-box 

Table 

(const) 

set 

Rsboxl , 

RTOP 

- 

14 

# 

S-box 

Table 

(const) 

set 

Rsbox2 , 

RTOP 

- 

13 

# 

S-box 

Table 

(const) 

set 

Rsbox3 , 

RTOP 

- 

12 

# 

S-box 

Table 

(const) 

set 

Rsbox4, 

RTOP 

- 

11 

# 

S-box 

Table 

(const) 

set 

Rsbox5 , 

RTOP 

- 

10 

# 

S-box 

Table 

(const) 

set 

Rsbox6 , 

RTOP 

- 

9 

# 

S-box 

Table 

(const) 

set 

Rsbox7 , 

RTOP 

- 

8 

# 

S-box 

Table 

(const) 

set 

Rsbox8 , 

RTOP 

- 

7 

# 

S-box 

Table 

(const) 

set 

Rsbox9 , 

RTOP 

- 

6 

# 

S-box 

Table 

(const) 

set 

RsboxA, 

RTOP 

- 

5 

# 

S-box 

Table 

(const) 

set 

RsboxB, 

RTOP 

- 

4 

# 

S-box 

Table 

(const) 

set 

RsboxC, 

RTOP 

- 

3 

# 

S-box 

Table 

(const) 

set 

RsboxD, 

RTOP 

- 

2 

# 

S-box 

Table 

(const) 

set 

RsboxE, 

RTOP 

- 

1 

# 

S-box 

Table 

(const) 

set 

RsboxF, 

RTOP 

- 

0 

# 

S-box 

Table 

(const) 

set 

Rround, 

2 

# 

Round 

counter 

set 

Rctr , 

3 

# 

CTR  (3  = 

=  reg  for  return) 

set 

RsboxOl , 

4 

# 

set 

Rsbox23 , 

5 

# 

set 

Rsbox45 , 

6 

# 

set 

Rsbox67 , 

7 

# 

set 

Rsbox89 , 

8 

# 

set 

RsboxAB, 

9 

# 

set 

RsboxCD, 

10 

# 

set 

RsboxEF, 

11 

# 

set 

Rstate , 

12 

# 

block 

State 

set 

Ridx, 

13 

# 

set 

Rblock, 

14 

# 

block 

counter 

set 

Rbit5 , 

15 

# 

set 

Rbit6 , 

16 

# 

set 

Rbit7 , 

17 

# 

set 

NR,  15 

# 

number  of  reg 

per  block  (unused) 

.set  Rsbox03,  RsboxOl  # 
.set  Rsbox47,  Rsbox23  # 
.set  Rsbox8B,  Rsbox45  # 
.set  RsboxCF,  Rsbox67  # 
.set  Rsbox07,  Rsbox03  # 
.set  Rsbox8F,  Rsbox47  # 
.set  Rtimes2,  Rsbox23  # 
.set  Rtimes2m,  Rsbox45  # 
.set  Rtimes3,  Rsbox67  # 
.set  Rcols,  Rsbox89  # 
.  set  Rrowl ,  RsboxAB  # 
.  set  Rrow2 ,  RsboxCD  # 
.  set  Rrow3 ,  RsboxEF  # 
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.  set 
.  set 


Rroundkey,  Rbit5 
Rdatablk,  Rbit6 


# 

# 


aes_ctr : 

#  load  tables  into  registers 


lqr 

$Rincr , 

[ncr 

lqr 

$Rshif trows,  ShiftRows 

lqr 

$Rsbox0 , 

Sbox+OxOO 

lqr 

$Rsboxl , 

Sbox+OxlO 

lqr 

$Rsbox2 , 

Sbox+0x20 

lqr 

$Rsbox3 , 

Sbox+0x30 

lqr 

$Rsbox4, 

Sbox+0x40 

lqr 

$Rsbox5 , 

Sbox+0x50 

lqr 

$Rsbox6 , 

Sbox+0x60 

lqr 

$Rsbox7 , 

Sbox+0x70 

lqr 

$Rsbox8 , 

Sbox+0x80 

lqr 

$Rsbox9 , 

Sbox+0x90 

lqr 

$RsboxA, 

Sbox+OxAO 

lqr 

$RsboxB, 

Sbox+OxBO 

lqr 

$RsboxC, 

Sbox+OxCO 

lqr 

$RsboxD, 

Sbox+OxDO 

lqr 

$RsboxE, 

Sbox+OxEO 

lqr 

$RsboxF, 

Sbox+OxFO 

#  setup  so  round  reg  counts  up  to  zero  from  neg. 

#  then  adjust  pointer  to  roundkeys  so  sum  points  to  round 


shli  $Rnrounds,  $Rin_nr,  4  #  #rounds*16 

sfi  $Rnrounds,  $Rnrounds,  0x10  # 

sf  $Rroundkeys ,  $Rnrounds ,  $Rin_key  # 

#  use  similar  count -up  with  block  counter 

shli  $Rblock,  $Rin_nb,  4  #  #blocks*16 

sfi  $Rblock,  $Rblock,  0  # 

sf  $Rdat ,  $Rblock,  $Rin_dat  # 

ori  $Rctr,  $Rin_ctr,  0  # 

Lblockloop: 

ori  $Rstate,  $Rctr,  0  # 

a  $Rctr,  $Rctr,  $Rincr  # 

ori  $Rround,  $Rnrounds,  0  # 

#  ROUND  0: 

#  SIMD  version  of  Add  Round  Key 

lqx  $Rroundkey,  $Rroundkeys,  $Rround  # 

xor  $Rstate,  $Rstate,  $Rroundkey  # 

Lroundloop: 

ai  $Rround,  $Rround,  0x10  # 

#  SIMD  version  of  S-box 

#  presumes  S-box  table  pre-loaded  into  sboxl  -  sboxF 

andbi  $Ridx,  $Rstate,  OxlF  # 


shufb 

$Rsbox01 , 

$Rsbox0 , 

$Rsboxl , 

$Ridx 

# 

shufb 

$Rsbox23 , 

$Rsbox2 , 

$Rsbox3 , 

$Ridx 

# 

shufb 

$Rsbox45 , 

$Rsbox4, 

$Rsbox5 , 

$Ridx 

# 

shufb 

$Rsbox67 , 

$Rsbox6 , 

$Rsbox7 , 

$Ridx 

# 

shufb 

$Rsbox89 , 

$Rsbox8 , 

$Rsbox9 , 

$Ridx 

# 

shufb 

$RsboxAB, 

$RsboxA, 

$RsboxB, 

$Ridx 

# 

shufb 

$RsboxCD, 

$RsboxC, 

$RsboxD, 

$Ridx 

# 

shufb 

$RsboxEF, 

$RsboxE, 

$RsboxF, 

$Ridx 

# 

key 

neg.  of  (#rounds-l) *16  to  addr  QW 
offset:  roundkeys+round  ->  round  key 

neg.  of  (#blocks)*16  to  addr  QW 
offset:  dataptr+block  ->  data 
move  initial  value  to  CTR 

move  CTR  to  State 
increment  CTR 
initialize  round  counter 


get  round  key 
add  it  to  state 

next  round  (*16) 


lower  5  bits  for  partial  lookup 
partial  lookup  if  3  msb  =  000 

partial  lookup  if  3  msb  =001 

partial  lookup  if  3  msb  =  010 

partial  lookup  if  3  msb  =  011 

partial  lookup  if  3  msb  =  100 

partial  lookup  if  3  msb  =  101 

partial  lookup  if  3  msb  =  110 

partial  lookup  if  3  msb  =  111 
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andbi  $Rbit5,  $Rstate,  0x20 

ceqbi  $Rbit5,  $Rbit5,  0x20 

selb  $Rsbox03,  $Rsbox01,  $Rsbox23,  $Rbit5 

selb  $Rsbox47,  $Rsbox45,  $Rsbox67,  $Rbit5 

selb  $Rsbox8B,  $Rsbox89,  $RsboxAB,  $Rbit5 

selb  $RsboxCF,  $RsboxCD,  $RsboxEF,  $Rbit5 

andbi  $Rbit6,  $Rstate,  0x40 

ceqbi  $Rbit6,  $Rbit6,  0x40 

selb  $Rsbox07,  $Rsbox03,  $Rsbox47,  $Rbit6 

selb  $Rsbox8F,  $Rsbox8B,  $RsboxCF,  $Rbit6 

clgtbi  $Rbit7,  $Rstate,  0x7F 

selb  $Rstate,  $Rsbox07,  $Rsbox8F,  $Rbit7 

#  SIMD  version  of  shift  rows 

#  presumes  shif trows  reg  pre-loaded  to: 

#  Ox  00  05  0A  OF  04  09  0E  03  08  0D  02  07  0C  01  06  0B 


#  get  next  bit  (#5) 

#  form  bytewise  selector 

#  partial  lookup  if  2  msb  =  00 

#  partial  lookup  if  2  msb  =01 

#  partial  lookup  if  2  msb  =  10 

#  partial  lookup  if  2  msb  =11 

#  get  next  bit  (#6) 

#  form  bytewise  selector 

#  partial  lookup  if  1  msb  =  0 

#  partial  lookup  if  1  msb  =  1 

#  form  selector  based  on  msb  (#7) 

#  finish  table  lookup 


shufb  $Rstate,  $Rstate,  $Rstate,  $Rshif trows 
#  SIMD  version  of  Mix  Columns 


#  move  bytes  around 


andbi 

$Rtimes2,  $Rstate,  0x7F 

#  ain’t  no  "shift  byte";  clear 

shlqbii 

$Rtimes2,  $Rtimes2,  1 

#  shift  block  1  bit 

xorbi 

$Rtimes2m,  $Rtimes2,  OxlB 

#  mod  field  polynomial 

clgtbi 

$Rbit7,  $Rstate,  0x7F 

#  if  msb  =  1 

selb 

$Rtimes2,  $Rtimes2,  $Rtimes2m,  $Rbit7 

#  now  have  byte  x  2  in  GF 

xor 

$Rtimes3,  $Rtimes2,  $Rstate 

#  also  byte  x  3 

roti 

$Rrowl,  $Rtimes3,  8 

#  rotate  columns  and  add: 

xor 

$Rcols,  $Rtimes2,  $Rrowl 

#  2  x  rO  +  3  x  rl 

roti 

$Rrow2,  $Rstate,  16 

xor 

$Rcols,  $Rcols,  $Rrow2 

#  +  1  x  r2 

roti 

$Rrow3,  $Rstate,  24 

xor 

$Rstate,  $Rcols,  $Rrow3 

#  +  1  x  r3 ,  and  done 

#  SIMD  version  of  Add  Round  Key 

#  assumes  round  reg  has  (round  number  -  #  rounds)  x  16, 

#  if  fully  unroll  round  loop,  could  also  pre-load  round 


lqx 

xor 

brnz 

ai 

#  LAST  ROUND 


$Rroundkey,  $Rroundkeys,  $Rround 
$Rstate,  $Rstate,  $Rroundkey 
$Rround,  Lroundloop 
$Rround,  $Rround,  0x10 


keyaddr  reg  points  to  last  key 
keys  into  registers 

#  get  round  key 

#  add  it  to  state 

#  branch  if  not  last  round 

#  next  round  (*16) 


#  SIMD  version  of  S-box 

#  presumes  S-box  table  pre-loaded  into  sboxl  -  sboxF 


andbi 

$Ridx,  $Rstate,  OxlF 

# 

lower  5 

bits  for  partial 

.  lookup 

shufb 

$Rsbox01 , 

$Rsbox0,  $Rsboxl, 

$Ridx 

# 

partial 

lookup 

if 

3 

msb 

=  000 

shufb 

$Rsbox23 , 

$Rsbox2 ,  $Rsbox3 , 

$Ridx 

# 

partial 

lookup 

if 

3 

msb 

=  001 

shufb 

$Rsbox45 , 

$Rsbox4,  $Rsbox5, 

$Ridx 

# 

partial 

lookup 

if 

3 

msb 

=  010 

shufb 

$Rsbox67 , 

$Rsbox6 ,  $Rsbox7 , 

$Ridx 

# 

partial 

lookup 

if 

3 

msb 

=  Oil 

shufb 

$Rsbox89 , 

$Rsbox8 ,  $Rsbox9 , 

$Ridx 

# 

partial 

lookup 

if 

3 

msb 

=  100 

shufb 

$RsboxAB, 

$RsboxA,  $RsboxB, 

$Ridx 

# 

partial 

lookup 

if 

3 

msb 

=  101 

shufb 

$RsboxCD, 

$RsboxC,  $RsboxD, 

$Ridx 

# 

partial 

lookup 

if 

3 

msb 

=  110 

shufb 

$RsboxEF, 

$RsboxE,  $RsboxF, 

$Ridx 

# 

partial 

lookup 

if 

3 

msb 

=  111 

andbi 

$Rbit5,  $Rstate,  0x20 

# 

get  next  bit  (#5) 

ceqbi 

$Rbit5,  $Rbit5,  0x20 

# 

form  bytewise  selector 

selb 

$Rsbox03 , 

$Rsbox01,  $Rsbox23,  $Rbit5 

# 

partial 

lookup 

if 

2 

msb 

=  00 

selb 

$Rsbox47 , 

$Rsbox45,  $Rsbox67,  $Rbit5 

# 

partial 

lookup 

if 

2 

msb 

=  01 

selb 

$Rsbox8B, 

$Rsbox89,  $RsboxAB,  $Rbit5 

# 

partial 

lookup 

if 

2 

msb 

=  10 

selb 

$RsboxCF, 

$RsboxCD,  $RsboxEF,  $Rbit5 

# 

partial 

lookup 

if 

2 

msb 

=  11 

16 


andbi  $Rbit6,  $Rstate,  0x40 

ceqbi  $Rbit6,  $Rbit6,  0x40 

selb  $Rsbox07,  $Rsbox03,  $Rsbox47,  $Rbit6 

selb  $Rsbox8F,  $Rsbox8B,  $RsboxCF,  $Rbit6 

clgtbi  $Rbit7,  $Rstate,  0x7F 

selb  $Rstate,  $Rsbox07,  $Rsbox8F,  $Rbit7 

#  SIMD  version  of  shift  rows 

#  presumes  shif trows  reg  pre-loaded  to: 

#  Ox  00  05  0A  OF  04  09  0E  03  08  0D  02  07  0C  01  06  OB 

shufb  $Rstate,  $Rstate,  $Rstate,  $Rshif trows  #  move  bytes  around 

#  SIMD  version  of  Add  Round  Key 

#  assumes  round  reg  has  (round  number  -  #  rounds)  x  16,  keyaddr  reg  points  to  last  key 

#  if  fully  unroll  round  loop,  could  also  pre-load  round  keys  into  registers 


lqx 

$Rroundkey,  $Rroundkeys,  $Rround 

# 

get  round  key 

xor 

$Rstate,  $Rstate,  $Rroundkey 

# 

add  it  to  state 

use  similar 

count -up  with  block  counter 

lqx 

$Rdatablk,  $Rdat ,  $Rblock 

# 

get  next  block  of 

data 

xor 

$Rdatablk,  $Rstate,  $Rdatablk 

# 

add  it  to  encrypted  CTR 

stqx 

$Rdatablk,  $Rdat ,  $Rblock 

# 

overwrite  block  of 

data 

ai 

$Rblock,  $Rblock,  0x10 

# 

next  block 

brnz 

$Rblock,  Lblockloop 

# 

branch  if  not  last 

block 

bi 

$lr  #  return 

. ident 

"DRC" 

#  get  next  bit  (#6) 

#  form  bytewise  selector 

#  partial  lookup  if  1  msb  =  0 

#  partial  lookup  if  1  msb  =  1 

#  form  selector  based  on  msb  (#7) 

#  finish  table  lookup 
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C  Final  AES  CTR  Assembly  Code 

Here  is  the  final  version  of  the  CTR4  code.  This  has  been  painstakingly  optimized.  (As  a  result,  it  is  pretty 
much  unreadable.)  Within  the  block  and  round  loops:  every  odd-pipeline  instruction  is  dual-issued;  there 
are  no  data  dependency  stalls;  all  branches  are  correctly  hinted  (except  the  final  iteration  of  the  block  loop). 
The  same  is  true  in  the  setup  (before  the  block  loop),  except  the  hint  table  loop  has  some  data  dependency 
stalls  and  its  last  iteration  branch  is  unhinted. 

The  format  is  as  in  the  optimization  examples  above:  named  registers  begin  $R  and  statement  labels  begin 
L;  pipeline  0  instructions  are  flush  left  while  pipeline  1  instructions  are  indented;  dual-issued  instruction  pairs 
are  indicated  by  braces. 

##  Revised  AES  function,  CTR  mode,  4-block  version 

##  2009  Jan  8  Thu  14:25:44  modified  to  take  #  bytes,  not  blocks 

##  5  input  parameters:  (NO  error  checking) 

##  pointer  to  data  buffer 

##  pointer  to  Round  Key  buffer 

##  number  of  data  BYTES  (was  BLOCKS) 

##  number  of  rounds 

##  counter  value  for  first  data  block 
##  1  output  parameter: 

##  counter  value  for  next  data  block 
.file  "aes_ctr.s" 

.section  mydata, "a" , Sprogbits 

.align  4 

Sbox : 

. octa  0x637C777BF26B6FC53001672BFED7AB76 
. octa  0xCA82C97DFA5947F0ADD4A2AF9CA472C0 
. octa  0xB7FD9326363FF7CC34A5E5F171D83115 
. octa  0x04C723C31896059A071280E2EB27B275 
. octa  0x09832ClAlB6E5AA0523BD6B329E32F84 
. octa  0x53D100ED20FCB15B6ACBBE394A4C58CF 
. octa  0xD0EFAAFB434D338545F9027F503C9FA8 
. octa  0x51A3408F929D38F5BCB6DA2110FFF3D2 
. octa  0xCD0C13EC5F974417C4A77E3D645D1973 
. octa  0x608 14FDC222A908846EEB814DE5E0BDB 

. octa  0xE0323A0A4906245CC2D3AC629195E479 
. octa  0xE7C8376D8DD54EA96C56F4EA657AAE08 
. octa  0xBA78252ElCA6B4C6E8DD741F4BBD8B8A 
. octa  0x703EB5664803F60E613557B986CllD9E 
. octa  0xE1F8981169D98E949B1E87E9CE5528DF 
. octa  0x8CA1890DBFE6426841992D0FB054BB16 
Shif tRows : 

.octa  0x00050A0F04090E03080D02070C01060B  #  standard  (row  0) 

.octa  0x050A0F00090E03040D02070801060B0C  #  row  1  on  top 

RotRow2 : 

.octa  0x02030001060704050A0B08090E0F0C0D  #  rotate  row  2  to  top 

#  Note:  to  rotate  word  by  bytes  using  shufb: 

#  000102030405060708090A0B0C0D0E0F 

#  0102030005060704090A0B080D0E0F0C 

#  0203000 1060704050A0B08090E0F0C0D 

#  03000 102070405060B08090A0F0C0D0E 
SaveReg: 

.fill  4*4,  4,  0 
BranchHints : 
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#  to  save  registers 

#  (size  cannot  exceed  8) 

#  for  dynamic  br.  hints 


.fill  16*4,  4,  0 

.text 

.global  aes_ctr 

.type  aes_ctr,  Of unction 


##REGISTER  DEFINITIONS## 
#  in/out  params 


.  set 

Rin_dat , 

3 

# 

.  set 

Rin_key , 

4 

# 

.  set 

Rin_nb , 

5 

# 

.  set 

Rin_nr , 

6 

# 

.  set 

Rin_ctr , 

7 

# 

.  set 

Rout_ctr , 

3 

# 

per  block  values 

.  set 

RsboxOl , 

2 

# 

.  set 

Rsbox23 , 

13 

# 

.  set 

Rsbox45 , 

4 

# 

.  set 

Rsbox67 , 

5 

# 

.  set 

Rsbox89 , 

6 

# 

.  set 

RsboxAB, 

7 

# 

.  set 

RsboxCD, 

8 

# 

.  set 

RsboxEF, 

9 

# 

.  set 

Rbit5 , 

10 

# 

.  set 

Rbit6 , 

11 

# 

.  set 

Rbit7 , 

12 

# 

.  set 

Rctr , 

3 

# 

.  set 

Rstate , 

Rbit7 

# 

.  set 

Ridx, 

RsboxEF 

# 

.  set 

Rsbox03 , 

RsboxOl 

# 

.  set 

Rsbox47 , 

Rsbox45 

# 

.  set 

Rsbox8B, 

Rsbox89 

# 

.  set 

RsboxCF, 

RsboxCD 

# 

.  set 

Rsbox07 , 

RsboxOl 

# 

.  set 

Rsbox8F, 

Rsbox89 

# 

.  set 

RrowO , 

2 

# 

.  set 

Rrowl , 

13 

# 

.  set 

RrowO 1 , 

4 

# 

.  set 

Rrow23 , 

5 

# 

.  set 

Rrows , 

6 

# 

.  set 

Rtimes2 , 

7 

# 

.  set 

Rzero , 

8 

# 

.  set 

Rdat , 

Rsbox23 

# 

.  set 

Rdatablk, 

Rbit5 

# 

.  set 

NR, 

12 

# 

independent 

of  block: 

.  set 

Rblockout , 

RsboxOl 

# 

.  set 

Rhint , 

51 

# 

.  set 

Rhints , 

52 

# 

.  set 

RroundkeyO , 

53 

# 

.  set 

Rblock, 

54 

# 

.  set 

Rround, 

55 

# 

.  set 

Rroundkey , 

56 

# 

#  (size  cannot  exceed  8) 


1st  param  =  ptr  to  block 
2nd  param  =  ptr  to  keys 
3rd  param  =  number  of  bytes 
4th  param  =  number  of  rounds 
5th  param  =  counter  initial  value 
output  param  =  counter  next  value 


CTR  =  output  (=lst  input) 
block  State 


temporary  zero  reg 
ptr  to  data  for  block 

number  of  reg  per  block 

temporary  copy  of  block  counter 
branch  hint 
branch  hint  table 

block  counter  (Oth  block  of  set) 
Round  counter 
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#  constant  values: 


set 

Rmod, 

50 

# 

for  mod  GF  poly 

set 

Rblkpad , 

57 

# 

block  pad 

set 

Rnrounds , 

58 

# 

#  of  Rounds 

set 

Rroundkeys , 

59 

# 

Keys  Ptr  (const) 

set 

Rincr , 

60 

# 

increment  for  CTR 

set 

Rshif trows , 

61 

# 

ShiftRows  (const) 

set 

Rshif trowl , 

62 

# 

set 

Rrotrow2 , 

63 

# 

set 

RsboxO , 

64 

# 

S-box  Table  (const) 

set 

Rsboxl , 

65 

# 

S-box  Table  (const) 

set 

Rsbox2 , 

66 

# 

S-box  Table  (const) 

set 

Rsbox3 , 

67 

# 

S-box  Table  (const) 

set 

Rsbox4, 

68 

# 

S-box  Table  (const) 

set 

Rsbox5 , 

69 

# 

S-box  Table  (const) 

set 

Rsbox6 , 

70 

# 

S-box  Table  (const) 

set 

Rsbox7 , 

71 

# 

S-box  Table  (const) 

set 

Rsbox8 , 

72 

# 

S-box  Table  (const) 

set 

Rsbox9 , 

73 

# 

S-box  Table  (const) 

set 

RsboxA, 

74 

# 

S-box  Table  (const) 

set 

RsboxB, 

75 

# 

S-box  Table  (const) 

set 

RsboxC, 

76 

# 

S-box  Table  (const) 

set 

RsboxD, 

77 

# 

S-box  Table  (const) 

set 

RsboxE, 

78 

# 

S-box  Table  (const) 

set 

align 

RsboxF, 

3 

79 

# 

S-box  Table  (const) 

aes_ctr : 


#  setup  so  round  reg  counts  up  to  zero  from  neg. 

#  then  adjust  pointer  to  roundkeys  so  sum  points  to  round  key 

#  use  similar  count -up  with  block  counter 

#  for  4  blocks  at  once,  keep  track  of  padding  at  end 

#  load  tables  into  registers 


{ 

{ 

{ 

{ 

{ 

{ 

{ 

{ 

{ 

{ 


shli 

$Rnrounds,  $Rin_nr,  4 

hbrr 

Lhinttabloop_end,  Lhinttabloop 

il 

$Rincr,  1 

lqr 

$RsboxO,  Sbox+OxOO 

ai 

$Rblkpad,  $Rin_nb,  15 

lqr 

$Rsboxl,  Sbox+OxlO 

sf  i 

$Rblock,  $Rin_nb,  0 

rotqmbyi  $Rincr,  $Rincr,  -12 

sf  i 

$Rnrounds,  $Rnrounds,  0x10 

lqr 

$Rsbox2,  Sbox+0x20 

andi 

$Rblkpad,  $Rblkpad,  48 

lqr 

$Rsbox3,  Sbox+0x30 

sf 

$Rroundkeys ,  $Rnrounds ,  $Rin_key 

lqr 

$Rsbox4,  Sbox+0x40 

andi 

$Rblock,  $Rblock,  -64 

lqr 

$Rsbox5,  Sbox+0x50 

ai 

$Rroundkeys,  $Rroundkeys,  0x10 

lqr 

$Rsbox6,  Sbox+0x60 

sf 

$Rdat ,  $Rblock,  $Rin_dat 

lqr 

$Rsbox7,  Sbox+0x70 

#  *16  to  address  quadwords 

#  hint  for  hint  loop 

#  round  up  to  whole  blocks 

#  -(#bytes) 

#  move  to  rightmost  word 

#  neg.  of  (#rounds-l) *16 

#  [  (#  blocks)  %  4  ]  *  16 

#  roundkeys+round  ->  round  key 

#  round  up  to  (4-block) s,  neg. 

#  adjust  since  lookup  before  incr 

#  dataptr+block  ->  data 
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r  ai 

$Rctr,  $Rin_ctr,  0 

# 

move  CTR  (clobber  Rin_dat ! ) 

X 

biz 

$Rin_nb,  $lr 

# 

return  if  no  bytes 

f  ai 

$(Rdat  +  NR),  $Rdat ,  0x10 

# 

data  ptr  for  block  1 

i 

lqr 

$Rsbox8,  Sbox+0x80 

f  a 

$(Rctr  +  NR),  $Rin_ctr,  $Rincr 

# 

increment  CTR  for  block  1 

i 

lqr 

$Rsbox9,  Sbox+0x90 

r  ai 

$(Rdat  +  2*NR) ,  $Rdat ,  0x20 

# 

data  ptr  for  block  2 

i 

lqr 

$RsboxA,  Sbox+OxAO 

r a 

$(Rctr  +  2*NR) ,  $(Rctr  +  NR),  $Rincr 

# 

increment  CTR  for  block  2 

i 

lqr 

$RsboxB,  Sbox+OxBO 

r  ai 

$(Rdat  +  3*NR) ,  $Rdat ,  0x30 

# 

data  ptr  for  block  3 

i 

lqr 

$RsboxC,  Sbox+OxCO 

/  a 

$(Rctr  +  3*NR) ,  $(Rctr  +  2*NR) ,  $Rincr 

# 

increment  CTR  for  block  3 

i 

lqr 

$RsboxD,  Sbox+OxDO 

r  rotmi 

$Rblkpad,  $Rblkpad,  -4 

# 

save  info  on  (#  blocks)  °/,  4 

\ 

lqr 

$RsboxE,  Sbox+OxEO 

r  shli 

$Rincr,  $Rincr,  2 

# 

shift  incr  for  4  blocks 

i 

lqr 

$RsboxF,  Sbox+OxFO 

r  ilh 

$Rmod,  OxlBlB 

# 

00  ->  IB  ->  IB 

i 

lqd 

$Rroundkey0,  0($Rin_key) 

# 

get  round  key  #0 

r  ila 

$Rhints,  BranchHints 

i 

lqr 

$Rshif trows,  ShiftRows 

J  ila 

$Rhint ,  Lroundloop 

l 

lqr 

$Rshiftrowl,  Shif tRows+OxlO 

r  sf 

$Rhints,  $Rnrounds,  $Rhints 

# 

hints+round  ->  round  hint 

i 

lqr 

$Rrotrow2,  RotRow2 

r  ai 

$Rround,  $Rnrounds,  0 

# 

initialize  round  counter 

i 

stqr 

$Rdat ,  SaveReg+OxOO 

# 

save  data  ptr 

r  ila 

$8 ,  Lroundloop_end  +  4 

# 

address  not  to  loop 

l 

stqr 

$(Rdat  +  NR),  SaveReg+OxlO 

# 

save  data  ptr 

stqr 

$(Rdat  +  2*NR) ,  SaveReg+0x20 

# 

save  data  ptr 

Lhinttabloop : 

stqx 

$Rhint ,  $Rhints,  $Rround 

# 

put  hint  for  each  round 

ai 

$Rround,  $Rround,  0x10 

# 

next  round  (+16) 

Lhinttabloop_end : 

brnz 

$Rround,  Lhinttabloop 

# 

branch  if  not  last  round 

. align 

3 

(  xor 

$(Rstate  +  NR),  $(Rctr  +  NR),  $Rroundkey0 

i 

stqx 

$Rhint ,  $Rhints,  $Rround 

# 

put  hint  for  next  round  loop 

r  xor 

$Rstate,  $Rctr,  $Rroundkey0 

# 

add  RK0  to  CTR 

i 

stqd 

$8,  -32($Rhints) 

# 

store  hint  not  to  loop 

r  xor 

$(Rstate  +  2*NR) ,  $(Rctr  +  2*NR) , 

$Rroundkey0 

\ 

shlqbyi 

$Rround,  $Rnrounds,  0 

# 

initialize  round  counter 

#  ROUND 

0  for  first  set  of  blocks: 

(  xor 

$(Rstate  +  3*NR) ,  $(Rctr  +  3*NR) , 

$Rroundkey0 

i 

stqr 

$(Rdat  +  3*NR) ,  SaveReg+0x30 

# 

save  data  ptr 

Lblockloop: 

. align 

#  initialize: 


$Rctr,  $Rctr,  $Rincr 
$(Rctr  +  NR),  $(Rctr  +  NR),  $Rincr 
$(Rctr  +  2*NR) ,  $(Rctr  +  2*NR) ,  $Rincr 
$(Rctr  +  3*NR) ,  $(Rctr  +  3*NR) ,  $Rincr 


#  increment  CTR 


Lroundloop: 
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#  SIMD  version 
. align 

J  andbi 
l  hbr 

r  andbi 
l  lqx 

J  andbi 
1  shufb 

J  andbi 
1  shufb 

J  andbi 
l  shufb 

/  andbi 
1  shufb 

r  ceqbi 

1  shufb 

J  ceqbi 
l  shufb 

J  clgtbi 
l  shufb 


selb 

selb 

selb 

selb 

selb 

andbi 

selb 

andbi 

selb 

andbi 

ceqbi 

ceqbi 

ceqbi 

andbi 

andbi 

xor 

andbi 


shufb 

shufb 

shufb 

shufb 

shufb 

shufb 

shufb 

shufb 

shufb 

shufb 

shufb 

shufb 

shufb 

shufb 

shufb 

lqx 

shufb 


of  S-box 
3 

$Ridx,  $Rstate,  OxlF 

Lroundloop_end,  $Rhint 

$(Ridx  +  NR),  $(Rstate  +  NR),  OxlF 

$Rhint ,  $Rhints,  $Rround 

$(Ridx  +  2*NR) ,  $(Rstate  +  2*NR) ,  OxlF 

$Rsbox01,  $RsboxO,  $Rsboxl,  $Ridx 

$(Ridx  +  3*NR) ,  $(Rstate  +  3*NR) ,  OxlF 

$Rsbox23,  $Rsbox2,  $Rsbox3,  $Ridx 

$Rbit5,  $Rstate,  0x20 

$Rsbox45,  $Rsbox4,  $Rsbox5,  $Ridx 

$Rbit6,  $Rstate,  0x40 

$Rsbox67,  $Rsbox6,  $Rsbox7,  $Ridx 

$Rbit5,  $Rbit5,  0x20 

$Rsbox89,  $Rsbox8,  $Rsbox9,  $Ridx 

$Rbit6,  $Rbit6,  0x40 

$RsboxAB,  $RsboxA,  $RsboxB,  $Ridx 

$Rbit7,  $Rstate,  0x7F 

$RsboxCD,  $RsboxC,  $RsboxD,  $Ridx 

$Rsbox03,  $Rsbox01,  $Rsbox23,  $Rbit5 

$RsboxEF,  $RsboxE,  $RsboxF,  $Ridx 


#  lower  5  bits  (0-4)  for  lookup 

#  hint  for  round  loop 

#  get  hint  for  next  round 

#  partial  lookup  if  3  msb  =  000 

#  partial  lookup  if  3  msb  =001 

#  get  next  bit  (#5) 

#  partial  lookup  if  3  msb  =  010 

#  get  next  bit  (#6) 

#  partial  lookup  if  3  msb  =  Oil 

#  form  bytewise  selector 

#  partial  lookup  if  3  msb  =  100 

#  form  bytewise  selector 

#  partial  lookup  if  3  msb  =  101 

#  form  selector  based  on  msb  (#7) 

#  partial  lookup  if  3  msb  =  110 

#  partial  lookup  if  2  msb  =  00 

#  partial  lookup  if  3  msb  =  111 

#  partial  lookup  if  2  msb  =01 


$Rsbox47,  $Rsbox45,  $Rsbox67,  $Rbit5 
$(Rsbox23  +  NR),  $Rsbox2,  $Rsbox3,  $(Ridx  +  NR) 

$Rsbox8B,  $Rsbox89,  $RsboxAB,  $Rbit5  #  partial  lookup  if  2  msb  =  10 
$(Rsbox23  +  2*NR) ,  $Rsbox2,  $Rsbox3,  $(Ridx  +  2*NR) 

$Rsbox07,  $Rsbox03,  $Rsbox47,  $Rbit6  #  partial  lookup  if  1  msb  =  0 
$(Rsbox67  +  NR),  $Rsbox6,  $Rsbox7,  $(Ridx  +  NR) 

$RsboxCF,  $RsboxCD,  $RsboxEF,  $Rbit5  #  partial  lookup  if  2  msb  =  11 
$(Rsbox89  +  2*NR) ,  $Rsbox8,  $Rsbox9,  $(Ridx  +  2*NR) 

$(Rbit5  +  NR),  $(Rstate  +  NR),  0x20 
$(Rsbox01  +  NR),  $Rsbox0,  $Rsboxl,  $(Ridx  +  NR) 

$Rsbox8F,  $Rsbox8B,  $RsboxCF,  $Rbit6  #  partial  lookup  if  1  msb  =  1 
$(RsboxAB  +  3*NR) ,  $RsboxA,  $RsboxB,  $(Ridx  +  3*NR) 

$(Rbit5  +  2*NR) ,  $(Rstate  +  2*NR) ,  0x20 
$(Rsbox01  +  2*NR) ,  $Rsbox0,  $Rsboxl,  $(Ridx  +  2*NR) 

$Rstate,  $Rsbox07,  $Rsbox8F,  $Rbit7  #  finish  table  lookup 
$(RsboxCD  +  3*NR) ,  $RsboxC,  $RsboxD,  $(Ridx  +  3*NR) 

$(Rbit5  +  3*NR) ,  $(Rstate  +  3*NR) ,  0x20 
$(Rsbox01  +  3*NR) ,  $Rsbox0,  $Rsboxl,  $(Ridx  +  3*NR) 

$ (Rbit5  +  NR),  $ (Rbit5  +  NR),  0x20 

$Rrowl,  $Rstate,  $Rstate,  $Rshiftrowl  #  move  bytes:  row  1 
+  2*NR) ,  $(Rbit5  +  2*NR) ,  0x20 

$Rstate,  $Rstate,  $Rshif trows  #  move  bytes  around:  row  0 


$  (Rbit5 
$Rrow0 , 

$  (Rbit5 
$ (Rsbox23 
$(Rbit6  + 
$ (Rsbox45 
$(Rbit6  + 
$ (Rsbox45 


3*NR) ,  $(Rbit5  +  3*NR) ,  0x20 
+  3*NR) ,  $Rsbox2,  $Rsbox3,  $(Ridx  +  3*NR) 

NR),  $(Rstate  +  NR),  0x40 
+  NR),  $Rsbox4,  $Rsbox5,  $(Ridx  +  NR) 

2*NR) ,  $(Rstate  +  2*NR) ,  0x40 
+  2*NR) ,  $Rsbox4,  $Rsbox5,  $(Ridx  +  2*NR) 
$Rrow01,  $Rrow0,  $Rrowl  #  (0+1) 

$Rroundkey,  $Rroundkeys,  $Rround  #  get  round  key 

$(Rbit6  +  3*NR) ,  $(Rstate  +  3*NR) ,  0x40 
$(Rsbox45  +  3*NR) ,  $Rsbox4,  $Rsbox5,  $(Ridx  +  3*NR) 
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{ 

{ 

{ 

{ 

{ 

{ 

{ 

{ 

{ 

{ 

{ 

{ 

{ 

{ 

{ 


{ 

{ 

{ 

{ 

{ 

{ 


ceqbi  $(Rbit6  +  NR),  $(Rbit6  +  NR),  0x40 

rotqbii  $Rtimes2,  $Rrow01,  1  #  mul  by  2 

ceqbi  $(Rbit6  +  2*NR) ,  $(Rbit6  +  2*NR) ,  0x40 

shufb  $(Rsbox67  +  2*NR) ,  $Rsbox6,  $Rsbox7,  $(Ridx  +  2*NR) 
ceqbi  $(Rbit6  +  3*NR) ,  $(Rbit6  +  3*NR) ,  0x40 

shufb  $(Rsbox67  +  3*NR) ,  $Rsbox6,  $Rsbox7,  $(Ridx  +  3*NR) 

clgtbi  $(Rbit7  +  NR),  $(Rstate  +  NR),  0x7F 

shufb  $(Rsbox89  +  NR),  $Rsbox8,  $Rsbox9,  $(Ridx  +  NR) 

clgtbi  $(Rbit7  +  2*NR) ,  $(Rstate  +  2*NR) ,  0x7F 

gbb  $Rbit7,  $Rtimes2  #  get  lsb  (was  msb) 

clgtbi  $(Rbit7  +  3*NR) ,  $(Rstate  +  3*NR) ,  0x7F 

shufb  $(Rsbox89  +  3*NR) ,  $Rsbox8,  $Rsbox9,  $(Ridx  +  3*NR) 
selb  $(Rsbox03  +  NR),  $(Rsbox01  +  NR),  $(Rsbox23  +  NR),  $(Rbit5  +  NR) 

shufb  $(RsboxAB  +  NR),  $RsboxA,  $RsboxB,  $(Ridx  +  NR) 

selb  $(Rsbox03  +  2*NR) ,  $(Rsbox01  +  2*NR) ,  $(Rsbox23  +  2*NR) ,  $(Rbit5  +  2*NR) 

shufb  $(RsboxAB  +  2*NR) ,  $RsboxA,  $RsboxB,  $(Ridx  +  2*NR) 

selb  $(Rsbox03  +  3*NR) ,  $(Rsbox01  +  3*NR) ,  $(Rsbox23  +  3*NR) ,  $(Rbit5  +  3*NR) 

fsmb  $Rbit7,  $Rbit7  #  byte  selector 

selb  $(Rsbox47  +  NR),  $(Rsbox45  +  NR),  $(Rsbox67  +  NR),  $(Rbit5  +  NR) 

shufb  $(RsboxCD  +  NR),  $RsboxC,  $RsboxD,  $(Ridx  +  NR) 

selb  $(Rsbox47  +  2*NR) ,  $(Rsbox45  +  2*NR) ,  $(Rsbox67  +  2*NR) ,  $(Rbit5  +  2*NR) 

shufb  $(RsboxCD  +  2*NR) ,  $RsboxC,  $RsboxD,  $(Ridx  +  2*NR) 

selb  $(Rsbox47  +  3*NR) ,  $(Rsbox45  +  3*NR) ,  $(Rsbox67  +  3*NR) ,  $(Rbit5  +  3*NR) 

shufb  $(RsboxEF  +  NR),  $RsboxE,  $RsboxF,  $(Ridx  +  NR) 

selb  $(Rsbox8B  +  NR),  $(Rsbox89  +  NR),  $(RsboxAB  +  NR),  $(Rbit5  +  NR) 

rotqbyi  $Rbit7,  $Rbit7,  -1  #  rot  back  to  source  byte 

selb  $(Rsbox8B  +  2*NR) ,  $(Rsbox89  +  2*NR) ,  $(RsboxAB  +  2*NR) ,  $(Rbit5  +  2*NR) 

shufb  $(RsboxEF  +  2*NR) ,  $RsboxE,  $RsboxF,  $(Ridx  +  2*NR) 

selb  $(Rsbox8B  +  3*NR) ,  $(Rsbox89  +  3*NR) ,  $(RsboxAB  +  3*NR) ,  $(Rbit5  +  3*NR) 

shufb  $(RsboxEF  +  3*NR) ,  $RsboxE,  $RsboxF,  $(Ridx  +  3*NR) 

selb  $(RsboxCF  +  NR),  $(RsboxCD  +  NR),  $(RsboxEF  +  NR),  $(Rbit5  +  NR) 

selb  $(Rsbox07  +  NR),  $(Rsbox03  +  NR),  $(Rsbox47  +  NR),  $(Rbit6  +  NR) 

selb  $(Rsbox8F  +  NR),  $(Rsbox8B  +  NR),  $(RsboxCF  +  NR),  $(Rbit6  +  NR) 

selb  $(RsboxCF  +  2*NR) ,  $(RsboxCD  +  2*NR) ,  $(RsboxEF  +  2*NR) ,  $(Rbit5  +  2*NR) 

selb  $(Rstate  +  NR),  $(Rsbox07  +  NR),  $(Rsbox8F  +  NR),  $(Rbit7  +  NR) 

selb  $(Rsbox07  +  2*NR) ,  $(Rsbox03  +  2*NR) ,  $(Rsbox47  +  2*NR) ,  $(Rbit6  +  2*NR) 

.align  3 

selb  $(Rsbox8F  +  2*NR) ,  $(Rsbox8B  +  2*NR) ,  $(RsboxCF  +  2*NR) ,  $(Rbit6  +  2*NR) 

shufb  $ (Rrowl  +  NR),  $(Rstate  +  NR),  $(Rstate  +  NR),  $Rshiftrowl 

selb  $(RsboxCF  +  3*NR) ,  $(RsboxCD  +  3*NR) ,  $(RsboxEF  +  3*NR) ,  $(Rbit5  +  3*NR) 

shufb  $(Rrow0  +  NR),  $(Rstate  +  NR),  $(Rstate  +  NR),  $Rshif trows 

selb  $(Rstate  +  2*NR) ,  $(Rsbox07  +  2*NR) ,  $(Rsbox8F  +  2*NR) ,  $(Rbit7  +  2*NR) 

selb  $(Rsbox07  +  3*NR) ,  $(Rsbox03  +  3*NR) ,  $(Rsbox47  +  3*NR) ,  $(Rbit6  +  3*NR) 

selb  $(Rsbox8F  +  3*NR) ,  $(Rsbox8B  +  3*NR) ,  $(RsboxCF  +  3*NR) ,  $(Rbit6  +  3*NR) 

shufb  $ (Rrowl  +  2*NR) ,  $(Rstate  +  2*NR) ,  $(Rstate  +  2*NR) ,  $Rshiftrowl 
xor  $ (Rrows  +  NR) ,  $ (Rrowl  +  NR) ,  $Rroundkey 

shufb  $(Rrow0  +  2*NR) ,  $(Rstate  +  2*NR) ,  $(Rstate  +  2*NR) ,  $Rshif trows 
selb  $(Rstate  +  3*NR) ,  $(Rsbox07  +  3*NR) ,  $(Rsbox8F  +  3*NR) ,  $(Rbit7  +  3*NR) 

xor  $(Rrow01  +  NR),  $(Rrow0  +  NR),  $ (Rrowl  +  NR) 

xor  $ (Rrows  +  2*NR) ,  $ (Rrowl  +  2*NR) ,  $Rroundkey 

shufb  $ (Rrowl  +  3*NR) ,  $(Rstate  +  3*NR) ,  $(Rstate  +  3*NR) ,  $Rshiftrowl 

xor  $(Rrow01  +  2*NR) ,  $(Rrow0  +  2*NR) ,  $ (Rrowl  +  2*NR) 

shufb  $(Rrow0  +  3*NR) ,  $(Rstate  +  3*NR) ,  $(Rstate  +  3*NR) ,  $Rshif trows 


#  SIMD  version  of  Shift  Rows  and  Mix  Columns  and  Add  Round  Key 
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{ 

{ 

{ 

{ 

{ 

{ 

{ 

{ 

{ 

{ 

{ 

{ 

{ 

{ 

{ 

{ 

{ 


ai 

xor 

xor 

xor 

cgtbi 

cgtbi 

cgtbi 

xor 

xor 

xor 

xor 

andbi 

andbi 

andbi 

andbi 

xor 

xor 

xor 

xor 

xor 

xor 

xor 


$Rround,  $Rround,  0x10  #  next  round  (*16) 

fsmbi  $Rzero,  0 

$Rrows ,  $Rrowl ,  $Rroundkey  #  1  +  RK 

shufb  $Rrow23,  $Rrow01,  $Rrow01,  $Rrotrow2  #  2+3 
$(Rrows  +  3*NR) ,  $(Rrowl  +  3*NR) ,  $Rroundkey 
$ (RrowOl  +  3*NR) ,  $(RrowO  +  3*NR) ,  $(Rrowl  +  3*NR) 

$ (Rbit7  +  NR),  $ (RrowOl  +  NR),  -1 
shufb  $(Rrow23  +  NR),  $(Rrow01  +  NR),  $(Rrow01  +  NR),  $Rrotrow2 
$ (Rbit7  +  2*NR) ,  $(Rrow01  +  2*NR) ,  -1 
shufb  $(Rrow23  +  2*NR) ,  $ (RrowOl  +  2*NR) ,  $ (RrowOl  +  2*NR) ,  $Rrotrow2 
$ (Rbit7  +  3*NR) ,  $(Rrow01  +  3*NR) ,  -1 
shufb  $(Rrow23  +  3*NR) ,  $ (RrowOl  +  3*NR) ,  $ (RrowOl  +  3*NR) ,  $Rrotrow2 
$Rrows,  $Rrows,  $Rrow23  #  1+2+3  +  RK 

shufb  $Rbit7,  $Rmod,  $Rmod,  $Rbit7  #  00  ->  IB,  FF  ->  80 

$(Rrows  +  NR),  $(Rrows  +  NR),  $(Rrow23  +  NR) 
shlqbii  $(Rtimes2  +  NR),  $(Rrow01  +  NR),  1 

$(Rrows  +  2*NR) ,  $(Rrows  +  2*NR) ,  $(Rrow23  +  2*NR) 
shlqbii  $(Rtimes2  +  2*NR) ,  $ (RrowOl  +  2*NR) ,  1 

$(Rrows  +  3*NR) ,  $(Rrows  +  3*NR) ,  $(Rrow23  +  3*NR) 
shlqbii  $(Rtimes2  +  3*NR) ,  $ (RrowOl  +  3*NR) ,  1 

$Rtimes2,  $Rtimes2,  OxFE  #  clear  lsb 

shufb  $Rbit7,  $Rmod,  $Rmod,  $Rbit7  #  IB  ->  IB,  80  ->  00 

$(Rtimes2  +  NR),  $(Rtimes2  +  NR),  OxFE 
shufb  $(Rbit7  +  NR),  $Rmod,  $Rmod,  $(Rbit7  +  NR) 

$(Rtimes2  +  2*NR) ,  $(Rtimes2  +  2*NR) ,  OxFE 
shufb  $(Rbit7  +  2+NR) ,  $Rmod,  $Rmod,  $(Rbit7  +  2+NR) 

$(Rtimes2  +  3*NR) ,  $(Rtimes2  +  3*NR) ,  OxFE 
shufb  $(Rbit7  +  3*NR) ,  $Rmod,  $Rmod,  $(Rbit7  +  3+NR) 

$Rrows,  $Rrows,  $Rtimes2  #  2* (0+1)  +  (1+2+3)  +  RK 

shufb  $Rbit7,  $Rmod,  $Rzero,  $Rbit7  #  IB  ->  00,  00  ->  IB 

$(Rrows  +  NR),  $(Rrows  +  NR),  $(Rtimes2  +  NR) 
shufb  $(Rbit7  +  NR),  $Rmod,  $Rmod,  $(Rbit7  +  NR) 

$(Rrows  +  2*NR) ,  $(Rrows  +  2*NR) ,  $(Rtimes2  +  2*NR) 
shufb  $(Rbit7  +  2*NR) ,  $Rmod,  $Rmod,  $(Rbit7  +  2*NR) 

$(Rrows  +  3*NR) ,  $(Rrows  +  3*NR) ,  $(Rtimes2  +  3*NR) 
shufb  $(Rbit7  +  3*NR) ,  $Rmod,  $Rmod,  $(Rbit7  +  3*NR) 

$Rstate,  $Rrows,  $Rbit7  #  mod  GF  poly 

$(Rstate  +  NR),  $(Rrows  +  NR),  $(Rbit7  +  NR) 

$(Rstate  +  2*NR) ,  $(Rrows  +  2*NR) ,  $(Rbit7  +  2*NR) 


.align  3 

xor  $(Rstate  +  3*NR) ,  $(Rrows  +  3*NR) ,  $(Rbit7  +  3*NR) 

Lroundloop_end : 

brnz  $Rround,  Lroundloop  #  branch  if  not  last  round 

#  LAST  ROUND 


#  SIMD  version  of  S-box 


|  andbi 

|  andbi 

andbi 
andbi 
f  andbi 


. align 

hbrr 

lqd 


shufb 


3 

$Ridx,  $Rstate,  OxlF 
Lblockloop_end,  Lblockloop 
$(Ridx  +  NR),  $(Rstate  +  NR),  OxlF 
$Rroundkey,  0 ($Rroundkeys) 

$(Ridx  +  2*NR) ,  $(Rstate  +  2*NR) ,  OxlF 
$(Ridx  +  3*NR) ,  $(Rstate  +  3*NR) ,  OxlF 
$Rbit5,  $Rstate,  0x20 
$Rsbox01,  $RsboxO,  $Rsboxl,  $Ridx 


#  lower  5  bits  (0-4)  for  lookup 

#  hint  for  block  loop 

#  get  round  key 

#  get  next  bit  (#5) 

#  partial  lookup  if  3  msb  =  000 
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andbi 


shufb 


andbi 


andbi 


ceqbi 

ceqbi 


shufb 


ceqbi 


shufb 


ceqbi 


shufb 


andbi 


andbi 


andbi 


shufb 


andbi 


shufb 


ceqbi 


shufb 


ceqbi 


ceqbi 

ceqbi 


clgtbi 


shufb 


clgtbi 


shufb 


clgtbi 


clgtbi 


shufb 


shufb 


shufb 


shufb 


shufb 


$(Rbit5  +  NR),  $(Rstate  +  NR),  0x20 
$(Rsbox01  +  NR),  $RsboxO,  $Rsboxl,  $(Ridx  +  NR) 

$(Rbit5  +  2*NR) ,  $(Rstate  +  2*NR) ,  0x20 
$(Rsbox01  +  2*NR) ,  $RsboxO,  $Rsboxl,  $(Ridx  +  2*NR) 

$(Rbit5  +  3*NR) ,  $(Rstate  +  3*NR) ,  0x20 
$(Rsbox01  +  3*NR) ,  $RsboxO,  $Rsboxl,  $(Ridx  +  3*NR) 

$Rbit5,  $Rbit5,  0x20  #  form  bytewise  selector 

$Rsbox23,  $Rsbox2,  $Rsbox3,  $Ridx  #  partial  lookup  if  3  msb  =  001 

$ (Rbit5  +  NR),  $ (Rbit5  +  NR),  0x20 
$(Rsbox23  +  NR),  $Rsbox2,  $Rsbox3,  $(Ridx  +  NR) 

$(Rbit5  +  2*NR) ,  $(Rbit5  +  2*NR) ,  0x20 

$(Rsbox23  +  2*NR) ,  $Rsbox2,  $Rsbox3,  $(Ridx  +  2*NR) 

$(Rbit5  +  3*NR) ,  $(Rbit5  +  3*NR) ,  0x20 

$(Rsbox23  +  3*NR) ,  $Rsbox2,  $Rsbox3,  $(Ridx  +  3*NR) 

$Rbit6,  $Rstate,  0x40  #  get  next  bit  (#6) 

$Rsbox45,  $Rsbox4,  $Rsbox5,  $Ridx  #  partial  lookup  if  3  msb  =  010 

$(Rbit6  +  NR),  $(Rstate  +  NR),  0x40 

$(Rsbox45  +  NR),  $Rsbox4,  $Rsbox5,  $(Ridx  +  NR) 

$(Rbit6  +  2*NR) ,  $(Rstate  +  2*NR) ,  0x40 
$(Rsbox45  +  2*NR) ,  $Rsbox4,  $Rsbox5,  $(Ridx  +  2*NR) 

$(Rbit6  +  3*NR) ,  $(Rstate  +  3*NR) ,  0x40 
$(Rsbox45  +  3*NR) ,  $Rsbox4,  $Rsbox5,  $(Ridx  +  3*NR) 

$Rbit6,  $Rbit6,  0x40  #  form  bytewise  selector 

$Rsbox67,  $Rsbox6,  $Rsbox7,  $Ridx  #  partial  lookup  if  3  msb  =  Oil 

$ (Rbit6  +  NR),  $ (Rbit6  +  NR),  0x40 
$(Rsbox67  +  NR),  $Rsbox6,  $Rsbox7,  $(Ridx  +  NR) 

$ (Rbit6  +  2*NR) ,  $(Rbit6  +  2*NR) ,  0x40 

$(Rsbox67  +  2*NR) ,  $Rsbox6,  $Rsbox7,  $(Ridx  +  2*NR) 

$ (Rbit6  +  3*NR) ,  $(Rbit6  +  3*NR) ,  0x40 

$(Rsbox67  +  3*NR) ,  $Rsbox6,  $Rsbox7,  $(Ridx  +  3*NR) 

$Rbit7,  $Rstate,  0x7F  #  form  selector  based  on  msb  (#7) 

$Rsbox89,  $Rsbox8,  $Rsbox9,  $Ridx  #  partial  lookup  if  3  msb  =  100 

$(Rbit7  +  NR),  $(Rstate  +  NR),  0x7F 
$(Rsbox89  +  NR),  $Rsbox8,  $Rsbox9,  $(Ridx  +  NR) 

$(Rbit7  +  2*NR) ,  $(Rstate  +  2*NR) ,  0x7F 
$(Rsbox89  +  2*NR) ,  $Rsbox8,  $Rsbox9,  $(Ridx  +  2*NR) 

$(Rbit7  +  3*NR) ,  $(Rstate  +  3*NR) ,  0x7F 
$(Rsbox89  +  3*NR) ,  $Rsbox8,  $Rsbox9,  $(Ridx  +  3*NR) 

$Rsbox03,  $Rsbox01,  $Rsbox23,  $Rbit5  #  partial  lookup  if  2  msb  =  00 

$RsboxAB,  $RsboxA,  $RsboxB,  $Ridx  #  partial  lookup  if  3  msb  =  101 

$(Rsbox03  +  NR),  $(Rsbox01  +  NR),  $(Rsbox23  +  NR),  $(Rbit5  +  NR) 

$(RsboxAB  +  NR),  $RsboxA,  $RsboxB,  $(Ridx  +  NR) 

$(Rsbox03  +  2*NR) ,  $(Rsbox01  +  2*NR) ,  $(Rsbox23  +  2*NR) ,  $(Rbit5  +  2*NR) 
$(RsboxAB  +  2*NR) ,  $RsboxA,  $RsboxB,  $(Ridx  +  2*NR) 

$(Rsbox03  +  3*NR) ,  $(Rsbox01  +  3*NR) ,  $(Rsbox23  +  3*NR) ,  $(Rbit5  +  3*NR) 
$(RsboxAB  +  3*NR) ,  $RsboxA,  $RsboxB,  $(Ridx  +  3*NR) 

$Rsbox47,  $Rsbox45,  $Rsbox67,  $Rbit5  #  partial  lookup  if  2  msb  =  01 

$RsboxCD,  $RsboxC,  $RsboxD,  $Ridx  #  partial  lookup  if  3  msb  =  110 

$(Rsbox47  +  NR),  $(Rsbox45  +  NR),  $(Rsbox67  +  NR),  $(Rbit5  +  NR) 

$(RsboxCD  +  NR),  $RsboxC,  $RsboxD,  $(Ridx  +  NR) 

$(Rsbox47  +  2*NR) ,  $(Rsbox45  +  2*NR) ,  $(Rsbox67  +  2*NR) ,  $(Rbit5  +  2*NR) 
$(RsboxCD  +  2*NR) ,  $RsboxC,  $RsboxD,  $(Ridx  +  2*NR) 

$(Rsbox47  +  3*NR) ,  $(Rsbox45  +  3*NR) ,  $(Rsbox67  +  3*NR) ,  $(Rbit5  +  3*NR) 
$(RsboxCD  +  3*NR) ,  $RsboxC,  $RsboxD,  $(Ridx  +  3*NR) 
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selb 

shufb 

selb 

shufb 

selb 

shufb 

selb 

shufb 

selb 

selb 

lqr 

selb 

lqr 

selb 

lqr 

selb 

lqr 

selb 

selb 

selb 

selb 

selb 

lqx 

selb 

lqx 

selb 

lqx 

selb 

lqx 

selb 

selb 

selb 

. align 

SIMD 

shlqby: 

version 

xor 

shufb 

xor 

shufb 

xor 

shufb 

xor 

shufb 

SIMD 

version 

$Rsbox8B,  $Rsbox89,  $RsboxAB,  $Rbit5 
$RsboxEF,  $RsboxE,  $RsboxF,  $Ridx 


#  partial  lookup  if  2  msb  =  10 

#  partial  lookup  if  3  msb  =  111 


$(Rsbox8B  +  NR),  $(Rsbox89  +  NR),  $(RsboxAB  +  NR),  $(Rbit5  +  NR) 

$ (RsboxEF  +  NR),  $RsboxE,  $RsboxF,  $(Ridx  +  NR) 

$(Rsbox8B  +  2*NR) ,  $(Rsbox89  +  2*NR) ,  $(RsboxAB  +  2*NR) ,  $(Rbit5  +  2*NR) 
$(RsboxEF  +  2*NR) ,  $RsboxE,  $RsboxF,  $(Ridx  +  2*NR) 

$(Rsbox8B  +  3*NR) ,  $(Rsbox89  +  3*NR) ,  $(RsboxAB  +  3*NR) ,  $(Rbit5  +  3*NR) 
$ (RsboxEF  +  3*NR) ,  $RsboxE,  $RsboxF,  $(Ridx  +  3*NR) 

$RsboxCF,  $RsboxCD,  $RsboxEF,  $Rbit5 


$RsboxCD 
$Rdat ,  SaveReg+OxOO 
$(RsboxCF  +  NR),  $(RsboxCD  + 
$(Rdat  +  NR),  SaveReg+OxlO 
$(RsboxCF  +  2*NR) ,  $(RsboxCD 
$(Rdat  +  2*NR) ,  SaveReg+0x20 
$(RsboxCF  +  3*NR) ,  $(RsboxCD 
$(Rdat  +  3*NR) ,  SaveReg+0x30 
$Rsbox07,  $Rsbox03,  $Rsbox47 
+  NR),  $(Rsbox03  + 
+  2*NR) ,  $(Rsbox03 
+  3*NR) ,  $(Rsbox03 
$Rsbox8B,  $RsboxCF 
$Rdat ,  $Rblock 
+  NR),  $(Rsbox8B  + 


$ (Rsbox07 
$ (Rsbox07 
$ (Rsbox07 
$Rsbox8F, 
$Rdatablk, 
$ (Rsbox8F 
$ (Rdatablk 


#  partial  lookup  if  2  msb  =11 

#  get  data  ptr 

NR),  $ (RsboxEF  +  NR),  $(Rbit5  +  NR) 

#  get  data  ptr 

+  2*NR) ,  $ (RsboxEF  +  2*NR) ,  $(Rbit5  +  2*NR) 

#  get  data  ptr 

+  3*NR) ,  $ (RsboxEF  +  3*NR) ,  $(Rbit5  +  3*NR) 

#  get  data  ptr 

$Rbit6  #  partial  lookup  if  1  msb  =  0 
NR),  $ (Rsbox47  +  NR),  $(Rbit6  +  NR) 

+  2*NR) ,  $ (Rsbox47  +  2*NR) ,  $(Rbit6  +  2*NR) 
+  3*NR) ,  $ (Rsbox47  +  3*NR) ,  $(Rbit6  +  3*NR) 
$Rbit6  #  partial  lookup  if  1  msb  =  1 

#  get  next  block  of  data 
NR),  $ (RsboxCF  +  NR),  $(Rbit6  +  NR) 


+  NR),  $(Rdat  +  NR),  $Rblock 


$(Rsbox8F  +  2*NR) ,  $(Rsbox8B 
$ (Rdatablk  +  2*NR) ,  $(Rdat  + 
$(Rsbox8F  +  3*NR) ,  $(Rsbox8B 
$ (Rdatablk  +  3*NR) ,  $(Rdat  + 
$Rstate,  $Rsbox07,  $Rsbox8F, 


$ (Rstate 
$ (Rstate 
3 

$ (Rstate 


+  NR),  $(Rsbox07  +  NR),  $(Rsbox8F  +  NR) 
+  2*NR) ,  $(Rsbox07  +  2*NR) ,  $(Rsbox8F  + 


+  2*NR) ,  $ (RsboxCF  +  2*NR) ,  $(Rbit6  +  2*NR) 
2*NR) ,  $Rblock 

+  3*NR) ,  $ (RsboxCF  +  3*NR) ,  $(Rbit6  +  3*NR) 
3*NR) ,  $Rblock 

$Rbit7  #  finish  table  lookup 

$ (Rbit7  +  NR) 

2*NR) ,  $(Rbit7  +  2*NR) 


+  3*NR) ,  $(Rsbox07 
$Rround,  $Rnrounds,  0 
uf  shift  rows 

$Rdatablk,  $Rdatablk,  $Rroundkey 
$Rstate,  $Rstate,  $Rstate,  $Rshif trows 


3*NR) ,  $ (Rsbox8F  +  3*NR) ,  $(Rbit7  +  3*NR) 
#  initialize  round  counter 


#  add  RK  to  data 

#  move  bytes  around 


$ (Rdatablk 
$ (Rstate  + 
$ (Rdatablk 
$ (Rstate  + 
$ (Rdatablk 
$ (Rstate  + 


+  NR),  $ (Rdatablk  +  NR),  $Rroundkey 

NR),  $ (Rstate  +  NR),  $ (Rstate  +  NR),  $Rshif trows 

+  2*NR) ,  $ (Rdatablk  +  2*NR) ,  $Rroundkey 

2*NR) ,  $ (Rstate  +  2*NR) ,  $ (Rstate  +  2*NR) ,  $Rshif trows 

+  3*NR) ,  $ (Rdatablk  +  3*NR) ,  $Rroundkey 

3*NR) ,  $ (Rstate  +  3*NR) ,  $ (Rstate  +  3*NR) ,  $Rshif trows 


$Rstate 


xor  $Rdatablk, 

shlqbyi  $Rblockout 

xor  $ (Rdatablk 

xor  $ (Rdatablk 

xor  $ (Rdatablk 

use  similar  count -up  with  block  counter 
.align  3 

ai  $Rblock,  $Rblock,  0x40 

stqx  $Rdatablk,  $Rdat ,  $Rblockout 


$Rdatablk. 

$Rblock,  0 
+  NR),  $ (Rdatablk  + 
+  2*NR) ,  $ (Rdatablk 
+  3*NR) ,  $ (Rdatablk 


#  now  encrypted  data 

#  copy  block  counter 
NR),  $ (Rstate  +  NR) 

+  2*NR) ,  $ (Rstate  +  2*NR) 

+  3*NR) ,  $ (Rstate  +  3*NR) 


#  next  block 

#  overwrite  block  of  data 
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f  xor  $(Rstate  +  NR),  $(Rctr  +  NR),  $RroundkeyO 

l  stqx  $(Rdatablk  +  NR),  $(Rdat  +  NR),  $Rblockout 

f  xor  $(Rstate  +  2*NR) ,  $(Rctr  +  2*NR) ,  $RroundkeyO 

l  stqx  $(Rdatablk  +  2*NR) ,  $(Rdat  +  2*NR) ,  $Rblockout 

f  xor  $(Rstate  +  3*NR) ,  $(Rctr  +  3*NR) ,  $RroundkeyO 

l  stqx  $(Rdatablk  +  3*NR) ,  $(Rdat  +  3*NR) ,  $Rblockout 

xor  $Rstate,  $Rctr,  $RroundkeyO  #  add  RKO  to  CTR  for  next  block 

Lblockloop_end : 

brnz  $Rblock,  Lblockloop  #  branch  if  not  last  block 

#  be  sure  to  return  correct  counter  for  block  after  last 

rotqmbyi  $Rblkpad,  $Rblkpad,  -12  #  move  to  rightmost  word 


sf 

$(Rctr  +  NR),  $Rincr,  $Rctr 

# 

back  up  loop 

ceqi 

$2,  $Rblkpad,  0 

selb 

$Rctr,  $(Rctr  +  NR),  $Rctr,  $2 

# 

#  to  pad 

a 

$Rout_ctr,  $Rctr,  $Rblkpad 

# 

now  +1  for  last  block 

bi 

$lr  #  return 

. ident 

"DRC" 
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D  AES  CBC  Assembly  Code 

Here  is  our  optimized  version  of  the  CBC  code  (called  CBC2).  Since  the  feedback  of  this  cryptographic 
mode  dictates  only  one  block  can  be  done  at  a  time,  the  resulting  code  is  somewhat  readable.  In  particular, 
this  code  shows  our  one-block  optimized  MixColumns  (with  ShiftRows  and  AddRoundKey) . 

There  are  still  some  unavoidable  data  dependency  stalls  in  this  code,  where  an  instruction  waits  to  use 
the  output  of  a  previous  one.  (Of  the  instructions  used:  all  pipeline  0  instructions  last  2  cycles  except  rotate 
and  shift  instructions  are  4  cycles;  all  pipeline  1  instructions  last  4  cycles  except  load  and  store  instructions 
are  6  cycles  and  branches  take  1  if  correctly  hinted  or  not  taken.) 

The  format  is  as  in  the  examples  above:  named  registers  begin  $R  and  statement  labels  begin  L;  pipeline 
0  instructions  are  flush  left  while  pipeline  1  instructions  are  indented;  dual-issued  instruction  pairs  are 
indicated  by  braces. 

Note:  the  no-operation  instructions  (nop  and  lnop)  are  only  to  keep  the  instruction  address  parity  aligned 
with  the  pipeline,  to  allow  later  dual  issues;  of  course,  they  themselves  are  dual-issued  and  do  not  affect  the 
timing;  they  could  have  been  replaced  by  .align  directives. 

##  AES  function,  CBC  mode,  2008  Dec  14  Sun  16:32:44 
##  with  NEW  improved  version  of  Mix  Columns 
##  (moved  polynomial  add  to  State) 

##  5  input  parameters:  (NO  error  checking) 

##  pointer  to  data  buffer 

##  pointer  to  Round  Key  buffer 

##  number  of  data  blocks  (must  be  compatible  with  length  of  data  buffer) 

##  number  of  rounds  (must  be  compatible  with  length  of  Round  Key  buffer) 

##  initial  value  for  first  data  block 
##  NO  output  parameters 

.file  "aes_cbc.s" 

.section  mydata, "a" , Sprogbits 

.align  4 

Sbox : 

. octa  0x637C777BF26B6FC53001672BFED7AB76 
. octa  0xCA82C97DFA5947F0ADD4A2AF9CA472C0 
. octa  0xB7FD9326363FF7CC34A5E5F171D83115 
. octa  0x04C723C31896059A071280E2EB27B275 
. octa  0x09832C1A1B6E5AA0523BD6B329E32F84 
. octa  0x53D100ED20FCB15B6ACBBE394A4C58CF 
. octa  0xD0EFAAFB434D338545F9027F503C9FA8 
. octa  0x51A3408F929D38F5BCB6DA2110FFF3D2 
. octa  0xCD0C13EC5F974417C4A77E3D645D1973 
. octa  0x608 14FDC222A908846EEB814DE5E0BDB 
. octa  0xE0323A0A4906245CC2D3AC629195E479 
. octa  0xE7C8376D8DD54EA96C56F4EA657AAE08 
. octa  0xBA78252ElCA6B4C6E8DD741F4BBD8B8A 
. octa  0x703EB5664803F60E613557B986CllD9E 
. octa  0xE1F8981169D98E949B1E87E9CE5528DF 
. octa  0x8CA1890DBFE6426841992D0FB054BB16 
ShiftRows : 

. octa  0x00050A0F04090E03080D02070C01060B 
. octa  0x050A0F00090E03040D02070801060B0C 
. octa  0x0A0F00050E0304090207080D060B0C01 
. octa  0x0F00050A0304090E07080D020B0C0106 
BranchHints : 

.fill  16*4,  4,  0 

.  text 


#  standard  (row  0) 

#  row  1  on  top 

#  row  2  on  top 

#  row  3  on  top 

#  for  dynamic  br.  hints 
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.global  aes_cbc 

.type  aes_cbc,  @f unction 


##REGISTER  DEFINITIONS## 


set 

Rin_dat , 

3 

set 

Rin_key , 

4 

set 

Rin_nb , 

5 

set 

Rin_nr , 

6 

set 

Rin_iv, 

7 

set 

Rround, 

10 

set 

Rroundkey , 

11 

set 

Riv, 

12 

set 

Rstate , 

13 

set 

Ridx, 

14 

set 

Rblock, 

15 

set 

Rbit5 , 

16 

set 

Rbit6 , 

17 

set 

Rbit7 , 

18 

set 

RsboxOl , 

19 

set 

Rsbox23 , 

20 

set 

Rsbox45 , 

21 

set 

Rsbox67 , 

22 

set 

Rsbox89 , 

23 

set 

RsboxAB, 

24 

set 

RsboxCD, 

25 

set 

RsboxEF, 

26 

set 

Rsbox03 , 

27 

set 

Rsbox47 , 

28 

set 

Rsbox8B, 

29 

set 

RsboxCF, 

30 

set 

Rsbox07 , 

31 

set 

Rsbox8F, 

32 

set 

Rshif trows , 

33 

set 

Rshif trowl , 

34 

set 

Rshif trow2 , 

35 

set 

Rshif trow3 , 

36 

set 

RrowO , 

37 

set 

Rrowl , 

38 

set 

Rrow2 , 

39 

set 

Rrow3 , 

40 

set 

RrowO 1 , 

41 

set 

Rtimes2 , 

42 

set 

Rtimes2m, 

43 

set 

Rblockout , 

44 

set 

Rnextdat , 

45 

set 

Rhint , 

46 

set 

Rhints , 

47 

set 

RroundkeyO , 

57 

set 

Rdatablk, 

58 

set 

Rnrounds , 

59 

set 

Rdat , 

61 

set 

Rroundkeys , 

62 

set 

RsboxO , 

64 

set 

Rsboxl , 

65 

#  1st  param  =  ptr  to  block 

#  2nd  param  =  ptr  to  keys 

#  3rd  param  =  number  of  blocks 

#  4th  param  =  number  of  rounds 

#  5th  param  =  counter  initial  value 

#  Round  counter 

# 

#  IV  =  Initial  Value 

#  block  State 

# 

#  block  counter 

# 

# 

# 

# 

# 

# 

# 

# 

# 

# 

# 

# 

# 

# 

# 

# 

# 

# 

# 

# 

# 

# 

# 

# 

# 

# 

# 

# 

#  block  counter  copy 

#  block  counter  copy 

#  branch  hint 

#  branch  hint  table 

# 

# 

#  #  of  Rounds 

#  1st  param  =  ptr  to  block 

#  Keys  Ptr  (const) 

#  S-box  Table  (const) 

#  S-box  Table  (const) 
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set 

Rsbox2 , 

66 

# 

S-box 

Table 

(const) 

set 

Rsbox3 , 

67 

# 

S-box 

Table 

(const) 

set 

Rsbox4, 

68 

# 

S-box 

Table 

(const) 

set 

Rsbox5 , 

69 

# 

S-box 

Table 

(const) 

set 

Rsbox6 , 

70 

# 

S-box 

Table 

(const) 

set 

Rsbox7 , 

71 

# 

S-box 

Table 

(const) 

set 

Rsbox8 , 

72 

# 

S-box 

Table 

(const) 

set 

Rsbox9 , 

73 

# 

S-box 

Table 

(const) 

set 

RsboxA, 

74 

# 

S-box 

Table 

(const) 

set 

RsboxB, 

75 

# 

S-box 

Table 

(const) 

set 

RsboxC, 

76 

# 

S-box 

Table 

(const) 

set 

RsboxD, 

77 

# 

S-box 

Table 

(const) 

set 

RsboxE, 

78 

# 

S-box 

Table 

(const) 

set 

RsboxF, 

79 

# 

S-box 

Table 

(const) 

.align  3 

aes_cbc : 

#  setup  so  round  reg  counts  up  to  zero  from  neg. 

#  then  adjust  pointer  to  roundkeys  so  sum  points  to  round  key 

#  use  similar  count -up  with  block  counter 


#  load 

tables 

into  registers  and  do  Round  #0  for  first 

block 

f  shli 

$Rnrounds,  $Rin_nr,  4  #  #rounds*16 

i 

lqr 

$Rsbox7,  Sbox+0x70 

f  shli 

$Rblock,  $Rin_nb,  4  #  #blocks*16 

i 

lqr 

$RsboxO,  Sbox+OxOO 

r  ori 

$Rdat ,  $Rin_dat ,  0 

# 

move  data  pointer 

i 

lqr 

$Rsboxl,  Sbox+OxlO 

r  ori 

$Rstate,  $Rin_iv,  0 

# 

move  IV  to  State 

i 

lqr 

$Rsbox2,  Sbox+0x20 

r  sf  i 

$Rnrounds,  $Rnrounds,  0x10 

# 

neg.  of  (#rounds-l) *16  to  addr  QW 

i 

lqr 

$Rsbox3,  Sbox+0x30 

r  sf  i 

$Rblock,  $Rblock,  0 

# 

neg.  of  (#blocks)*16  to  addr  QW 

i 

lqr 

$Rsbox4,  Sbox+0x40 

r  sf 

$Rroundkeys ,  $Rnrounds ,  $Rin_key 

# 

offset:  roundkeys+round  ->  round  key 

l 

lqr 

$Rsbox5,  Sbox+0x50 

r  sf 

$Rdat ,  $Rblock,  $Rdat 

# 

offset:  dataptr+block  ->  data 

t 

lqr 

$Rsbox6,  Sbox+0x60 

f  ori 

$Rround,  $Rnrounds,  0 

# 

initialize  round  counter 

i 

lqx 

$Rroundkey0 ,  $Rroundkeys ,  $Rnrounds 

# 

get  round  key  #0 

r  ai 

$Rnextdat,  $Rdat ,  0x10 

# 

data  ptr  for  next  round  (*16) 

i 

lqx 

$Rdatablk,  $Rdat ,  $Rblock 

# 

get  first  block  of  data 

r  ila 

$Rhints,  BranchHints 

i 

lqr 

$Rsbox8,  Sbox+0x80 

r  ila 

$Ridx,  Lroundloop_end  +  4 

# 

address  not  to  loop 

l 

lqr 

$Rsbox9,  Sbox+0x90 

r  sf 

$Rhints,  $Rnrounds,  $Rhints 

# 

offset:  hints+round  ->  round  hint 

i 

lqr 

$RsboxA,  Sbox+OxAO 

r  ila 

$Rhint ,  Lroundloop 

l 

lqr 

$RsboxB,  Sbox+OxBO 

lqr 

$RsboxC,  Sbox+OxCO 

lqr 

$Rshiftrowl,  Shif tRows+OxlO 

f  xor 

$Rstate,  $Rstate,  $Rdatablk 

# 

add  data  to  current  state 

i 

lqr 

$RsboxD,  Sbox+OxDO 

lqr 

$RsboxE,  Sbox+OxEO 
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lqr 

$RsboxF,  Sbox+OxFO 

J  xor 

$Rstate,  $Rstate,  $RroundkeyO 

# 

add  round  key  0  to  state 

1  lqr 

$Rshif trows,  ShiftRows 

lqr 

$Rshiftrow2,  Shif tRows+0x20 

lqr 

$Rshiftrow3,  Shif tRows+0x30 

Lhinttabloop : 

stqx 

$Rhint ,  $Rhints,  $Rround 

# 

put  hint  for  each  round 

ai 

$Rround,  $Rround,  0x10 

# 

next  round  (*16) 

brnz 

$Rround,  Lhinttabloop 

# 

branch  if  not  last  round 

stqx 

$Rhint ,  $Rhints,  $Rround 

# 

put  hint  for  next  round  loop 

stqd 

$Ridx,  -16($Rhints) 

# 

store  hint  not  to  loop 

ori 

$  Rr  ound ,  $Rnrounds,  0 

# 

initialize  round  counter 

. align 

3 

Lroundloop: 

# 

also  top  of  Block  Loop 

#  SIMD  version 

of  S-box 

r  andbi 

$Ridx,  $Rstate,  OxlF 

# 

lower  5  bits  for  partial  lookup 

l  lnop 

r  ai 

$Rround,  $Rround,  0x10 

# 

next  round  (*16) 

l  hbr 

Lroundloop_end,  $Rhint 

# 

hint  for  round  loop 

r  andbi 

$Rbit5,  $Rstate,  0x20 

# 

get  next  bit  (#5) 

l  shufb 

$Rsbox01,  $Rsbox0,  $Rsboxl,  $Ridx 

# 

partial  lookup  if  3  msb  = 

000 

J  andbi 

$Rbit6,  $Rstate,  0x40 

# 

get  next  bit  (#6) 

l  shufb 

$Rsbox23,  $Rsbox2,  $Rsbox3,  $Ridx 

# 

partial  lookup  if  3  msb  = 

001 

J  ceqbi 

$Rbit5,  $Rbit5,  0x20 

# 

form  bytewise  selector 

l  shufb 

$Rsbox45,  $Rsbox4,  $Rsbox5,  $Ridx 

# 

partial  lookup  if  3  msb  = 

010 

r  ceqbi 

$Rbit6,  $Rbit6,  0x40 

# 

form  bytewise  selector 

i  shufb 

$Rsbox67,  $Rsbox6,  $Rsbox7,  $Ridx 

# 

partial  lookup  if  3  msb  = 

011 

J  clgtbi 

$Rbit7,  $Rstate,  0x7F 

# 

form  selector  based  on  msb  (#7) 

l  shufb 

$Rsbox89,  $Rsbox8,  $Rsbox9,  $Ridx 

# 

partial  lookup  if  3  msb  = 

100 

f  selb 

$Rsbox03,  $Rsbox01,  $Rsbox23,  $Rbit5 

# 

partial  lookup  if  2  msb  = 

00 

l  shufb 

$RsboxAB,  $RsboxA,  $RsboxB,  $Ridx 

# 

partial  lookup  if  3  msb  = 

101 

r  nop 

l  shufb 

$RsboxCD,  $RsboxC,  $RsboxD,  $Ridx 

# 

partial  lookup  if  3  msb  = 

110 

f  selb 

$Rsbox47,  $Rsbox45,  $Rsbox67,  $Rbit5 

# 

partial  lookup  if  2  msb  = 

01 

l  shufb 

$RsboxEF,  $RsboxE,  $RsboxF,  $Ridx 

# 

partial  lookup  if  3  msb  = 

111 

f  selb 

$Rsbox8B,  $Rsbox89,  $RsboxAB,  $Rbit5 

# 

partial  lookup  if  2  msb  = 

10 

1  lqx 

$Rroundkey,  $Rroundkeys,  $Rround 

# 

get  round  key 

f  selb 

$RsboxCF,  $RsboxCD,  $RsboxEF,  $Rbit5 

# 

partial  lookup  if  2  msb  = 

11 

l  lqx 

$Rhint ,  $Rhints,  $Rround 

# 

get  hint  for  next  round 

selb 

$Rsbox07,  $Rsbox03,  $Rsbox47,  $Rbit6 

# 

partial  lookup  if  1  msb  = 

0 

selb 

$Rsbox8F,  $Rsbox8B,  $RsboxCF,  $Rbit6 

# 

partial  lookup  if  1  msb  = 

1 

selb 

$Rstate,  $Rsbox07,  $Rsbox8F,  $Rbit7 

# 

finish  table  lookup 

#  SIMD  version 

of  shift  rows 

shufb 

$Rrowl,  $Rstate,  $Rstate,  $Rshiftrowl 

# 

move  bytes:  row  1 

shufb 

$Rrow0,  $Rstate,  $Rstate,  $Rshif trows 

# 

move  bytes  around:  row  0 

shufb 

$Rrow2,  $Rstate,  $Rstate,  $Rshiftrow2 

# 

move  bytes:  row  2 

shufb 

$Rrow3,  $Rstate,  $Rstate,  $Rshiftrow3 

# 

move  bytes:  row  3 

#  SIMD  version 

of  Mix  Columns  and  Add  Round  Key 

xor 

$Rstate,  $Rrowl ,  $Rroundkey 

# 

1  +  RK 

xor 

$Rrow01,  $Rrow0,  $Rrowl 

# 

0+1 

xor 

$Rstate,  $Rstate,  $Rrow2 

# 

1+2  +  RK 

. align 

3 

r  clgtbi 

$Rbit7,  $Rrow01,  0x7F 

# 

if  msb  =  1 

l  shlqbii  $Rtimes2,  $Rrow01,  1 

# 

shift  block  1  bit 
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xorbi 

andbi 


. align 


$Rstate,  $Rstate,  $Rrow3 

$Rtimes2m,  $Rstate,  OxlB 

$Rtimes2,  $Rtimes2,  OxFE 

$Rstate,  $Rstate,  $Rtimes2m,  $Rbit7 

3  #  not  really  nec.  here 

$Rstate,  $Rstate,  $Rtimes2 


Lroundloop_end : 

brnz  $Rround 

#  LAST  ROUND 

#  SIMD  version  of  S-box 

.align  3 

f  andbi  $Ridx, 


$Rround,  Lroundloop 


andbi 


ceqbi 


ceqbi 


f  andbi  $Ridx,  $Rstate,  OxlF 

l  hbrr  Lblockloop_end,  Lroundloop 

f  ori  $Rblockout,  $Rblock,  0 

l  lqx  $Rdatablk,  $Rnextdat,  $Rblock 

f  andbi  $Rbit5,  $Rstate,  0x20 

l  shufb  $Rsbox01,  $RsboxO,  $Rsboxl,  $Ridx 

f  andbi  $Rbit6,  $Rstate,  0x40 

l  shufb  $Rsbox23,  $Rsbox2,  $Rsbox3,  $Ridx 

f  ceqbi  $Rbit5,  $Rbit5,  0x20 

l  shufb  $Rsbox45,  $Rsbox4,  $Rsbox5,  $Ridx 

f  ceqbi  $Rbit6,  $Rbit6,  0x40 

l  shufb  $Rsbox67,  $Rsbox6,  $Rsbox7,  $Ridx 

f  clgtbi  $Rbit7,  $Rstate,  0x7F 

l  shufb  $Rsbox89,  $Rsbox8,  $Rsbox9,  $Ridx 

f  selb  $Rsbox03,  $Rsbox01,  $Rsbox23,  $Rbit5 

l  shufb  $RsboxAB,  $RsboxA,  $RsboxB,  $Ridx 

f  ai  $Rblock,  $Rblock,  0x10 

l  shufb  $RsboxCD,  $RsboxC,  $RsboxD,  $Ridx 

f  selb  $Rsbox47,  $Rsbox45,  $Rsbox67,  $Rbit5 

l  shufb  $RsboxEF,  $RsboxE,  $RsboxF,  $Ridx 

f  selb  $Rsbox8B,  $Rsbox89,  $RsboxAB,  $Rbit5 

l  lqd  $Rroundkey,  0x10 ($Rroundkeys) 

f  selb  $RsboxCF,  $RsboxCD,  $RsboxEF,  $Rbit5 

l  shlqbyi  $Rround,  $Rnrounds,  0 

selb  $Rsbox07,  $Rsbox03,  $Rsbox47,  $Rbit6 

selb  $Rsbox8F,  $Rsbox8B,  $RsboxCF,  $Rbit6 

selb  $Rstate,  $Rsbox07,  $Rsbox8F,  $Rbit7 

.align  3 

#  SIMD  version  of  shift  rows 

xor  $Rdatablk,  $Rdatablk,  $Rroundkey0 

shufb  $Rstate,  $Rstate,  $Rstate,  $Rshif trows 

#  SIMD  version  of  Add  Round  Key 

xor  $Rstate,  $Rstate,  $Rroundkey 

#  use  similar  count -up  with  block  counter 

stqx  $Rstate,  $Rdat,  $Rblockout 
'  xor  $Rstate,  $Rstate,  $Rdatablk 

Lblockloop_end : 

.  brnz  $Rblock,  Lroundloop 

bi  $lr  #  return 

. ident  "DRC" 


Lblockloop_end : 
brnz 
bi 

. ident 


#  1+2+3  +  RK 

#  mod  field  polynomial 

#  clear  lsb 

#  now  1+2+3+RK  mod  poly 

#  2+ (0+1)  +  1+2+3  +  RK,  done 

#  branch  if  not  last  round 


#  lower  5  bits  for  partial  lookup 

#  hint  for  block  loop 

#  copy  block  #  for  output 

#  get  next  block  of  data 

#  get  next  bit  (#5) 

#  partial  lookup  if  3  msb  =  000 

#  get  next  bit  (#6) 

#  partial  lookup  if  3  msb  =001 

#  form  bytewise  selector 

#  partial  lookup  if  3  msb  =  010 

#  form  bytewise  selector 

#  partial  lookup  if  3  msb  =  Oil 

#  form  selector  based  on  msb  (#7) 

#  partial  lookup  if  3  msb  =  100 

#  partial  lookup  if  2  msb  =  00 

#  partial  lookup  if  3  msb  =  101 

#  next  block 

#  partial  lookup  if  3  msb  =  110 

#  partial  lookup  if  2  msb  =01 

#  partial  lookup  if  3  msb  =  111 

#  partial  lookup  if  2  msb  =  10 

#  get  round  key 

#  partial  lookup  if  2  msb  =11 

#  initialize  round  counter 

#  partial  lookup  if  1  msb  =  0 

#  partial  lookup  if  1  msb  =  1 

#  finish  table  lookup 

#  add  round  key  0  to  next  data 

#  move  bytes  around 

#  add  round  key  to  state 

#  overwrite  block  of  data 

#  add  data+RKO  to  current  state 

#  branch  if  not  last  block 
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